seanh commented 1 year ago

Question

It might make the implementation easier (and the design different) if we can implement a first version that is only capable of sending a much smaller number of emails, verify that the feature actually increases engagement, then re-implement it as a larger-scale version we can roll out to everyone.

Answer

We'll design the first version of this to be able to operate at a big enough scale to enable notifications for all our institutions, but we'll roll it out to institutions slowly so we can monitor how it performs in case we need to adjust as it scales. Yes, we want to get something (a first minimal version of the email template) released to some users (this can be a test cohort of as few users as we want: tens, hundreds) as soon as possible so that we can start getting feedback and verifying the feature.

seanh commented 1 year ago

While I think we can probably design the code/architecture of the first version of this such that we shouldn't have to completely rewrite it in order to scale it up to all our institutions, I think there are some constraints here that actually require us to roll it out slowly:

Constraint: generating the data for the emails.

How quickly can we actually generate all the data for each night's emails? How much resources (CPU, Elasticsearch, Postgres) does it take? Can our systems handle it at full scale?

I don't think we can answer these questions ahead of time. We just have to implement it, roll it out to a small number of institutions, and monitor how it performs.

In terms of things like CPU and memory usage I think we can easily horizontally scale the numbers of celery workers and EC2 instances that we're using for this (as well as vertically scaling the size of each individual instance) so I don't anticipate any problems there.

Elasticsearch and Postgres are not so easy: I think we can vertically scale these if necessary but only within limits. Are we going to be hitting these services too hard, trying to generate too many emails in too short of a time?

My guess is that we'll be fine but I think we'll have to see how it performs as we roll this out.
Constraint: actually sending the emails.

How fast can Mailchimp send emails? What is our account quota? Rate limit?

I think we're probably going to be fine here. Our current hourly rate limit in Mailchimp/Mandrill is 13,000 emails per hour which I think is roughly 5-10x more than we actually need.
Constraint: our Mailchimp reputation score.

This determines our hourly rate limit with Mailchimp. The score is based on how many emails we send and how many bounces, spam complaints and Mailchimp complaints our emails get. We may need to ramp up this reputation score and hourly rate limit slowly over time, and if we have problems with our emails bouncing or users considering our emails to be spam etc we might get throttled.
Constraint: getting the data into Mailchimp.

How fast can the Mailchimp API accept emails from us? And how fast can we send them into Mailchimp?

As far as I can tell we need to make one Mailchimp API call for each individual email we want to send, so we're going to have to make thousands of these API calls each night. Specifically, I think we'll be calling their send-template API with the async=True option (which enables optimised batch sending).

I think we'll again be fine here. If our Mailchimp sending rate limit is 13,000 per hour then presumably their API can accept at least that many emails per hour from us. That doesn't mean that a single process trying to call the API thousands of times synchronously is going to be able to send the emails fast enough. We may have to parallelize this. But we can easily distribute the work among as many celery workers on as many instances as we want, so I think we should be able to hit their API as fast as we need to.
Constraint: getting tasks onto the celery queue.

I think we very likely want to have a celery task that generates and sends the nightly email for a user, and each night we want to enqueue one instance of this task for each active instructor. So each night a scheduled celery task will need to run that finds out who all the active instructors are and enqueues a send-email task for each instructor. This "bootstrap" celery task will need an efficient way to find out who all the active instructors are each night, and then it will need to enqueue thousands of celery tasks. Is that going to work?

seanh commented 1 year ago

Slack thread

hypothesis / lms

Question: can we roll instructor email digests out to just a few institutions at first? #4897

Question

Answer