karafka / waterdrop

Standalone Karafka library for producing Kafka messages
https://karafka.io
MIT License
252 stars 39 forks source link

Dispatched collection in the ProduceManyError contains DeliveryHandle and DeliveryReport #518

Closed ujasur closed 1 day ago

ujasur commented 1 month ago

According the documentation, dispatched collection in the ProduceManyError must contain only instances of DeliveryHandle but at some error cases it contains DeliveryReport.

The reason is that the produce_many_sync method of the Sync class uses in place map! when waiting for messages to be delivered and when one of the messages in the collection fails to be delivered then final dispatched collection passed to the ProduceManyError has an instances of the DeliveryReport and DeliveryHandle. Link to the loop: https://github.com/karafka/waterdrop/blob/master/lib/waterdrop/producer/sync.rb#L72

An user code may run into the bug when iterating over the elements of dispatched collection in the reported error ProduceManyError

mensfeld commented 1 month ago

I think we need to cover three cases:

produce may raise an error before all messages are dispatched despite error retries - in this case a bubbled up error WILL cause dispatched to contain handlers

if all produce are successful in sync, we wait. With sync we wait for all. Even if one is errored, we wait on all and return delivery reports that MAY contain errors. The #cause will be first error that occurred during dispatch

the question is, should we wait in handlers in case of an inline error that bubbles up in the middle of #produce array

because something like buffer overflow in the internal queue will cause an inline error breaking the initial produce iteration

in such case should we also wait on those that were successfully enqueued :thinking_face:

technically for sync (as async will stay as it is because it always returns handles) we should wait on them

so technically we should also wait in the rescue flow

mensfeld commented 1 month ago

there is one more case: when transactional inline errors, the dispatched should be empty because transaction is rolled back

mensfeld commented 1 month ago

so let me write the cases for sync

  1. if there is a transaction it should either contain ALL reports or nothing in case of error because technically nothing is dispatched because transaction is rolled back
  2. if there is an inline error in the middle of dispatching we should wait on those that were dispatched to have handlers. Errored or not but final verdict handlers and the rest will not be there
  3. in case there is no inline error, we should wait on ALL handler regardless of their delivery status (failure or not) and reraise in case ANY is failed but with ALL of them being materialized

@ujasur does it make sense?

mensfeld commented 5 days ago

@ujasur you mind checking https://github.com/karafka/waterdrop/pull/525?

ujasur commented 3 days ago

Maciej , sorry for long delay. I think it make sense to me too. One thing I am not sure whether it makes sense to give an option to an user not to wait on ALL handles if an inline error occurs. Because an user may not care waiting delivery reports on other messages if there is one failed message. Alternatively, the error class may contain 2 collection, one for the deliver reports and other for the handlers.

Having said that, your outlined logic still support our use case, thank you!

mensfeld commented 2 days ago

Because an user may not care waiting delivery reports on other messages if there is one failed message. Alternatively, the error class may contain 2 collection, one for the deliver reports and other for the handlers.

Can we agree on shipping this as a feature when need comes? Technically with all timeouts set correctly (defaults are correct) the wait will handle any extra messages failed deliveries. Without waiting you don't get the whole picture.

If you are ok with leaving this for now I will merge and start crafting new release.

ujasur commented 2 days ago

Thank you, I am totally fine with that

mensfeld commented 1 day ago

Ok. I'm merging this to master and will need like a week or so to catch up on things and craft a release since it's a breaking one.