Open keathley opened 6 years ago
It is possible to lose messages on node death/crash.
We try to avoid this scenario with the back-pressure mechanism. The GenStage pipeline will only consume events if there is available demand. When the max_demand (default 50) is reached, the KaufmannEx.Stages.Producer
blocks replying to the Kafka consumer, triggering an offset commit, and requesting more events until there is spare demand.
Also, by using the KafkaEx GenConsumer :async_commit
we avoid immediately committing the offset until a certain number of events have been processed. By default this is set to commit every 100 events or 5 seconds.
If event processing time is heterogeneous, it's possible that some slow events may not be completed until well after the offset has moved past them. In this scenario those events may be lost on crash.
Thanks for the further clarity @gerbal. To be clear I think we both agree that you can drop messages under certain failure scenarios. At the very least it seems like thats worth calling out in some documentation.
More generally, I can't help but feel that GenStage is causing a lot more problems then benefits here. Your Kafka consumers already provide backpressure. If you remove GenStage from the mix then the worst case scenario is that you re-process a message after a crash or failure. But you won't drop the message completely.
I could see the benefit of a GenStage pipeline if you needed to compose many long running and stateful processes. Essentially if you were trying to replace something like Storm. But in that case I'd expect to see more partitioning of events across consumers similar to what I described in #16 as well as exposing the consumer producer api from GenStage. Based on the current api, description and y'alls blog post it seems like the general use case is consuming events and producing events with a few side effects in the mix. Given that use case I really don't think you need GenStage. Just my opinion though :).
I'm still working on a test that can prove this but I believe that there is a potential for losing messages inside of a consumer. Apologies if my understanding of the code is incorrect.
Here's the issue as I see it. The Kafka Consumer pulls messages from kafka. It then notifies the GenStage pipeline and then commits offsets async. Because processing is then done in an asynchronous fashion its possible that the kafka consumer can commit offsets before any downstream work is done. If the processing crashes for any reason (deployment, crash, node failure, etc.) then those messages won't be re-processed since the offset has already been committed.