segmentio / kafka-go

Kafka library in Go
MIT License
7.61k stars 790 forks source link

Expose error channel to clients when using Writer in async mode #405

Open Evanjt1 opened 4 years ago

Evanjt1 commented 4 years ago

Currently, if Writer is in async mode, ie: WriterConfig.Async == true, any errors that occur when writing the messages to kafka are ignored. It would be nice for a variety of use cases to have visibility of these errors while in async mode. I'd specifically like visibility so I can process failed messages and possibly send to dead-letter queue without losing the niceties of sending asynchronously.

As a possible implementation, I was thinking of adding an errors channel (chan WriterError) which a client can read from retrieve the errors. The channel would be buffered and the size would be controlled by the WriterConfig. Sarama has something similar, see https://github.com/Shopify/sarama/blob/6e063e619ea7e2bd19a18782a3c727cf0d6b0c3e/async_producer.go#L47

Clients will need to read from the error channel otherwise it will block the partitionWriters. This functionality would need to be completely optional.

401 will provide the error type needed to support this so it feels like a natural enhancement to make. Interested to hear feedback

dmarkhas commented 4 years ago

+1 We really like this library, but this is a deal breaker for us - if our services are failing to produce to Kafka, we MUST know about it. Confluent's lib gives us the ability to build logging, metrics and alerts around failures, and while I find its API quite cumbersome we really can't give up that visibility (and we have to use Async writes due to our scale).

Perhaps it would be easier to add an optional callback method to WriteMessages, that would be invoked when the result is ready, instead of exposing a channel, WDYT?

achille-roussel commented 4 years ago

if our services are failing to produce to Kafka, we MUST know about it

kafka-go exposes metrics and logs about the errors as well via WriterStats and WriterConfig.ErrorLogger.

Could you explain what limitations you are running into with these solutions?

I'm also interested to hear why using a synchronous writer (which returns the errors) does not work in your case.

dmarkhas commented 4 years ago

They simply don't align with our existing logging and metrics infrastructure. For example, we use https://github.com/rcrowley/go-metrics - I don't really see how WriterStats would integrate with it, whereas in an event driven approach we simply increment a meter whenever an error occurs. Furthermore, these metrics are somewhat limited because they are without context - for instance, what is errors? We can't really derive any meaningful information from these metrics...

Letting the user plug in his own metrics system via any kind of API (channel, callback, etc.) would be amazing.

Evanjt1 commented 4 years ago

Perhaps it would be easier to add an optional callback method to WriteMessages, that would be invoked when the result is ready, instead of exposing a channel, WDYT?

I would probably prefer channels over callbacks since it gives clients more control over resource allocation, but it also adds more client development overhead since error readers will have to be created and managed to consume from the channel.

TLDR tradeoffs either way

berupp commented 4 years ago

@Evanjt1 Channels in public APIs are discouraged over callbacks due to a few significant drawbacks: https://dave.cheney.net/practical-go/presentations/qcon-china.html#_leave_concurrency_to_the_caller

I'd also love to have a callback that allows me to decide how I want to proceed with the message that failed to be produced (after internal retries).

dmarkhas commented 4 years ago

I took a stab at a naive implementation via #420

RO-29 commented 4 years ago

Hello, are there any updates on #420 ??

I would really like to use async mode with ability to log/receive error

Thx a lot for this 😄

achille-roussel commented 3 years ago

Logging errors still happens in async mode, the errors are sent to the logger configured on the writer.

I am curious to understand what type of logic would be implemented when capturing errors on async writer; it seems that anything other than just logging the error would be better served by using the writer in sync mode so the application receives back-pressure when normal operations cannot be performed.

dmarkhas commented 3 years ago

Mostly metrics in our case, which serve as the basis for monitors and alerts. A high rate of async produce failures should start ringing bells, but we shouldn't be sacrificing throughput if we don't need to.

ghstahl commented 1 year ago

The reason I want to stick with async==true: I don't want to write retry logic when this one is very good.

What happens when in async==true mode I flood the queue?
I would think that knowing these errors I could have a circuit breaker that would stop the writes.