pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.55k stars 956 forks source link

Support 'debounce-able' tasks #12582

Open di opened 1 year ago

di commented 1 year ago

Right now, our task queue supports 1:1 events - one task fired results in one 'action' occuring.

For a number of reasons, it would be interesting to explore 'debouncing' these events: either queuing them up and aggregating them into a single 'action', or ensuring that a given 'action' doesn't happen multiple times within a given time period.

Previous chat history between maintainers on the subject included below is specific to emails, but debouncing tasks like purges may be a way to improve https://github.com/pypi/warehouse/issues/12214.


@di:

so we sort of have a general problem with email sending, in that we occasionally want to send an email if some action occurs (like an upload to a project with 2FA required with a non-2FA enabled account) but we don't want to send a flurry of the same email over a given period if the action is repeated (like uploading multiple files).

I'm thinking about working on a way for us to 'debounce' certain emails so they are only sent once in a period, but wanted to see if y'all have thoughts on the approach before I dig into it

I think it's going to require us to maintain some state on what emails have been sent to what users in a given period

which, we already kinda do, but not sure if we should shoehorn this into that or not

@dstufft:

The state we save right now is specific to SES, in theory our backends support other mail sending (which we use in local dev, but not elsewhere), we'd probably want debouncing to be independent of the SES backend I think

I could think of some really easy ways to do it with temporal :anguished: but I think we need to decide between two overall behaviors: We just rate limit emails by some key (type? does other context play into the rate limit?), so if you do the same thing X times, the rate limit debounces future emails. We queue email sends for a period of time (5 minutes? 10 minutes?), and sending repeat emails actually just appends more data to the queued email. The first of those is easier to implement, you can probably just use the ratelimiting stuff we have in the email sending task, and just treat a rate limit failure as a successful task send. The second of those is harder to implement, but I think provides a much better user experience (imagine for instance we send emails say anytime someone publishes a file, rate limiting would mean you get a single email even if you pushes 5 files, but it's context would only contain the data for the first file, not the 2nd-5th, the queued + collapsed context you'd get a single email, that mentioned all 5 files) you'll also probably need to add a distributed lock into the mix either way well maybe not in the scond option since the "queue" acts as a defacto distributed lock as long as it has ACID transactions (so like a pending emails table)

@di:

I guess there might be two types of emails we'd want to debounce, those that could be combined into a single email with a short debounce limit like 5 minutes, and those that we want to limit to avoid spam, with a debounce limit of like 24 hours, or multiple days.

@dstufft:

Yea that's a good point too

like the 2FA warning email like you mentioned

@di:

I could also see something like a "daily summary of account activity" email being popular though

di commented 1 year ago

Pretty rough attempt at this here: https://github.com/pypi/warehouse/compare/main...di:warehouse:fix-12582?expand=1

One of the challenges is that it's difficult to pass information between the task creation and the time when the task is actually run.

@ewdurbin found a clever workaround to avoid passing the call counter around: https://stackoverflow.com/a/34290849, but we still need a good way to configure the "debounce key" in code, and then also have that available to the task at runtime.