Low-latency consumer polling

zidik commented 1 month ago

Scenario: I have a low-volume topic (0-0.5 msg/sec) for which I'd like to have a low propagation delay (up to 30ms) from producer to consumer.

Issue: With brod default settings, the propagation latency hovers closer to 1000ms, because of sleep_timeout:1000ms - every time brod receives an empty response from broker, it will sleep for 1000ms. In our low-volume topic, this happens almost every time.

Attempted solution I turned off sleep_timeout (sleep_timeout: 0) and to prevent exessive polls, I configured the consumer to work in "long-poll-like" manner:

min_bytes=1 - ensures that the broker would wait and return only if there is at least something in the topic
max_wait_time=10 000ms - if no messages arrive within 10 seconds, just return, and start a new request

This works perfectly with 1 partition. The latency is super low, and request rate is also low - new request is only sent when message is received or 10 seconds pass.

Problem: but as soon as I add another partition to a topic, it fails - the latency skyrockets to 10 000 - 20 000ms, as brod polls each partition one by one, 10 seconds each. 😞 This is because brod makes a separate request for each partition, and it makes these requests within a single connection. Kafka broker handles only a single in-flight request per connection, therefore one running "long-poll" prevents others from starting.

I would have expected brod to make a single poll, but for all partitions.

Question: Is there any way to achieve the low-latency "long-poll" I described?

Considered alternative solutions:

Only reducing sleep_timeout to close to 0ms - This would result in all consumers constantly hammering the Kafka broker.
Using only one partition - it would work, but I will have bursts of high traffic, and I'd like to spread this load across multiple partitions.

zmstone commented 1 month ago

Thank you for the report. Batching fetch request cross-partition makes error handling more complicated. I have a plan to implement per-partition connection, WDYT.

zidik commented 1 month ago

Thanks for the quick reply! Yes, a separate connection per partition would help here, as each partition could be long-polled independently.

kafka4beam / brod

Low-latency consumer polling #577