Test using redis as the celery broker

miketheman commented 1 year ago

We're using Amazon SQS as our message broker today.

Celery support for SQS does not afford Monitoring nor Remote Control from within the context of Celery.

Missing monitor support means that the transport doesn’t implement events, and as such Flower, celery events, celerymon and other event-based monitoring tools won’t work. Remote control means the ability to inspect and manage workers at runtime using the celery inspect and celery control commands (and other tools using the remote control API). --https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/index.html

We can monitor items like queue depth for SQS via Cloudwatch Metrics (and possibly in Datadog) but we have little visibility as to what is in the queue itself.

@ewdurbin and I recently chatted about the notion of queue debouncing/superseding, and I raised the question of whether we might want to use Redis as the broker instead. We already use Redis as part of the stack, and we could leverage the same cluster/instance for the Celery broker. This would do a couple of things:

Remove reliance on a provider's secret sauce service in favor of a portable technology
Increase visibility via tools like Flower
~Remove some extra dependencies needed for SQS like kombu~
Increase usage of our current Redis cluster - both in terms of bandwidth and memory. TBD on how to look at this.

We also discussed using SQLAlchemy and Postgres as a backend, and rejected it. The docs kinda support that decision.

The change "seems" easy, since most of our setup is already governed by a config var of BROKER_URL: https://github.com/pypi/warehouse/blob/67d5b04228bd401eb75d29c3efec92383e51c9ad/warehouse/config.py#L180 https://github.com/pypi/warehouse/blob/38e0e0400f8585e382aa5d48836ef08fcfde742a/warehouse/tasks.py#L184

We already use Redis in aspects of our celery lifecycle: https://github.com/pypi/warehouse/blob/67d5b04228bd401eb75d29c3efec92383e51c9ad/warehouse/config.py#L181-L182

We could set celery.broker_url to REDIS_URL - but I'm noticing that this persists a pattern of dumping all of the keys into redis's default database db:0 - which would make selecting specifics keys harder, since everything would be intermingled. KEYS * can be an expensive operation (kinda like SELECT * FROM <<EVERYTHING>>)

We also share the same db:0 with oidc.jwk_cache_url, sessions.url, ratelimit.url, and warehouse.xmlrpc.cache.url - so that's not 100% awesome either.

Here's some semi-structured thoughts, totally open to more.

(Maybe?) Break apart the usage of REDIS_URL and create db-specific config variables for differently-scoped keys, like ratelimit, sessions, et al. Not totally sure how to manage the data migration yet.
Launch a set of workers with a config value of BROKER_URL pointing to a redis database to eat any new messages
Update the web/web-uploads config value of BROKER_URL pointing to redis
Wait for any outstanding SQS messages to be consumed and the queue emptied
Shut down the SQS-configured workers
Celebrate!

Obviously, I'm probably overlooking something, so would definitely want to hear other reasonings, opinions, thoughts, mistakes, etc.

dstufft commented 1 year ago

I think Redis had limitations that RabbitMQ and SQS didn't have, but that might have been fixed in the intervening years (or we may not care about them anymore). I don't recall exactly, but I think it was something like, Redis relied on Celery requeing failed tasks, but SQS and RabbitMQ the task would automatically be requeued if the worker didn't report successful within some amount of time.

miketheman commented 1 year ago

Redis relied on Celery requeing failed tasks, but SQS and RabbitMQ the task would automatically be requeued if the worker didn't report successful within some amount of time.

That might have been the concept of visibility_timeout that is only supported by SQS and Redis queues.

https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html#visibility-timeout https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html#id1

RabbitMQ had more "smarts" and the broker would decide how to handle when workers left and never came back.

The default is now 1 hour, and we don't override it for SQS as far as I can tell either - so we should be on par for that behavior.

dstufft commented 1 year ago

That might have been it, seems likely!

To be clear I don't have any real opinion on what broker implementation we use, I was just trying to remember the context the original decision was made in, to see if it still applied.

miketheman commented 1 year ago

An additional datapoint - we include some packages that are specific to the SQS implementation, such as pycurl as part of kombu's `extras: https://github.com/celery/kombu/blob/1dfe4f3c86ab0fd2587a6fe8566bb3cef8c4a5d7/requirements/extras/sqs.txt#L2

We'd be able to drop that if we converted to using Redis.

miketheman commented 3 months ago

Closing. We have no plans to replace SQS for now, but the investigation proved useful to show that it's possible, if we wanted to pick it up again. Before we do so, we should probably split apart the redis clients to use distinct databases for isolation.

ewdurbin commented 2 months ago

I went back in Slack history of #pypi-admin and didn't see any particular motivator, mostly just that SQS was available 🤷🏼.

I think this is worth pursuing and will put some effort in.

pypi / warehouse

Test using redis as the celery broker #13702