hypothesis / product-backlog

Where new feature ideas and current bugs for the Hypothesis product live
118 stars 7 forks source link

Investigate viability of SQS as an alternative to RabbitMQ for the message broker in h #652

Closed robertknight closed 6 years ago

robertknight commented 6 years ago

A partner that we are working with has expressed a preference to use SQS rather than RabbitMQ if possible for the message queue, to avoid adding additional infrastructure (whether they run it themselves or use eg. CloudAMQP) for their ops team to manage.

Celery supports SQS as does the underlying Kombu library which we use directly. The issue here is to research whether this would be possible to support and whether there are any issues with doing so.

robertknight commented 6 years ago

The way that the realtime messaging is implemented by h.realtime is not going to work with Amazon SQS.

h.realtime relies on a feature of AMQP where "direct exchanges" can, despite their name, be used to send one message to multiple consumers by creating multiple queues which are associated with the same routing key. When a producer publishes a message with that routing key, the exchange will deliver it to all queues bound to this key.

In our context, the web process publishes a message to a routing key which is a category of event (eg. realtime-annotation) and each websocket process creates one queue per category with the same routing keys. When the web process publishes a realtime update message, each websocket process will receive it.

Making this multicasting happen relies on a central entity (the RabbitMQ message broker) knowing about all the queues and their routing keys. When using Kombu's direct messaging with Amazon SQS however, the "exchange" exists purely in memory in each client and the routing key is used as the SQS queue name. If my understanding is correct, this means that if multiple consumers try to pull from the "exchange" with the same routing key, then one of the consumers (eg. one of the websocket processes) will win and the others will never see the message.

Kombu does also support topic and fanout exchanges which are more directly designed for this kind of multicast broadcast. The documentation states that Amazon SQS supports fanout exchanges by having the routing table be stored in SimpleDB but it looks like this was actually removed some time ago.

robertknight commented 6 years ago

I've investigated our use of Celery for background tasks. Fortunately the way we use Celery for tasks conforms to the various limitations that come with SQS. Namely:

By adding the necessary runtime dependencies (boto3, pycurl Python deps and the curl-dev Alpine package) I was able to deploy a Docker image configured to use SQS for the broker instead of RabbitMQ and trigger a successful background task execution on the deployed container.

I still need to verify that period tasks function correctly when using SQS.

robertknight commented 6 years ago

The Celery docs state that the SQS broker support is stable. The note in the README appears to be out of date. Fixed by https://github.com/celery/celery/pull/4756

robertknight commented 6 years ago

Our periodic tasks are executed in production by a separate "h-periodic" service. That will need some additional changes to support SQS.

robertknight commented 6 years ago

PR for h - https://github.com/hypothesis/h/pull/5035

robertknight commented 6 years ago

I've verified that SQS is working with Celery's background tasks executed via h-periodic as well. So to wrap this up:

  1. https://github.com/hypothesis/h/pull/5035 enables use of Amazon SQS for h
  2. https://github.com/hypothesis/h-periodic/pull/8 enables use of Amazon SQS for triggering periodic tasks in h
  3. The realtime server is not going to work with SQS. Fortunately it is not an essential part of h. When BROKER_URL is configured to be sqs:// then realtime updates are disabled in https://github.com/hypothesis/h/pull/5035
robertknight commented 6 years ago

I'm going to wrap this up for now. The Elasticsearch migration is taking priority this week but we'll get the SQS PRs merged when people have sufficient bandwidth free.