Message broker preferences

Context

We work in an environment where we don't want any data to be lost (e.g. if we were just tracking user activity, we wouldn't mind losing a few click events, etc.).

As such, we care about:

persistence (e.g. messages survive broker shutdowns)
durability (e.g. a message sits in a durable queue while the subscriber is down)
acknowledgements (a message is not deleted until it is acknowledged by a subscriber)

Choices

The initial choice between Redis and RabbitMQ for Kingfisher Process is described at https://github.com/open-contracting/kingfisher-process/issues/232#issuecomment-571687803 Redis high durability is still "Very very slow", according to the docs. It looks like to re-process unacknowledged messages, a worker will need to run XAUTOCLAIM on restart (the broker doesn't do it automatically).

RabbitMQ is working out okay, but it seems to shut down due to lack of memory and/or have heartbeat timeouts (despite having a thread dedicated to the heartbeat) at least once every few weeks. It's perhaps a problem with any distributed processing, but it would be nice not to have this noise. (We'll try upgrading from 3.8 to 3.11.)

Some other options by Apache include ActiveMQ, Kafka and Pulsar.

Kafka doesn't implement AMPQ like ActiveMQ or RabbitMQ but uses its own protocol, in which messages aren't acknowledged on the broker, but instead consumers record their position in a partition. Messages need to be deleted by some other policy (e.g. max count or max age). Kafka partitions a "topic" such that one consumer (per consumer group) reads a partition. This can cause a bad/slow message or stuck consumer to block later messages in that partition (unlike RabbitMQ). Kafka doesn't have a built-in UI. Kafka is really for high-volume streams, not really for long-running tasks. This paper compares RabbitMQ 3.5 to Kafka 0.10. From this 2022 post:

As Kafka and RabbitMQ operators, we feel that it's a bit more complicated to handle failures in Kafka. The process to recover or fix something is usually more time consuming and bit more messy.
ActiveMQ implements AMQP 1.0 whereas RabbitMQ uses 0.9.1 (an official plugin is available for 1.0). It's written in Java so it comes with the overhead of deploying Java software.
Pulsar is much less popular, so might not be a good choice just for that reason.

RedPanda is similar to Kafka. I don't know if it'll go the way of Elasticsearch (i.e. need to commercialize leading to proprietary licensing). That said, it apparently has less overhead in terms of deployment – if we wanted Kafka.

Non-brokers

ZeroMQ has no persistence, durability (message retention), or acknowledgments, so it's not an option. (Such features can perhaps be built on top of ZeroMQ, but that's extra overhead for us.)

I looked briefly into Apache Avro, which provides RPC and a fast row-based format.

The RPC requires the receiver to be running (like ZeroMQ). If Process goes down, then messages sent from Collect can be lost. "fast writes" are not a problem between Collect and Process, because the source's latency is the bottleneck.

open-contracting / software-development-handbook