Open Evanjt1 opened 4 years ago
+1 We really like this library, but this is a deal breaker for us - if our services are failing to produce to Kafka, we MUST know about it. Confluent's lib gives us the ability to build logging, metrics and alerts around failures, and while I find its API quite cumbersome we really can't give up that visibility (and we have to use Async writes due to our scale).
Perhaps it would be easier to add an optional callback method to WriteMessages, that would be invoked when the result is ready, instead of exposing a channel, WDYT?
if our services are failing to produce to Kafka, we MUST know about it
kafka-go exposes metrics and logs about the errors as well via WriterStats
and WriterConfig.ErrorLogger
.
Could you explain what limitations you are running into with these solutions?
I'm also interested to hear why using a synchronous writer (which returns the errors) does not work in your case.
They simply don't align with our existing logging and metrics infrastructure. For example, we use https://github.com/rcrowley/go-metrics - I don't really see how WriterStats
would integrate with it, whereas in an event driven approach we simply increment a meter whenever an error occurs.
Furthermore, these metrics are somewhat limited because they are without context - for instance, what is errors
? We can't really derive any meaningful information from these metrics...
Letting the user plug in his own metrics system via any kind of API (channel, callback, etc.) would be amazing.
Perhaps it would be easier to add an optional callback method to WriteMessages, that would be invoked when the result is ready, instead of exposing a channel, WDYT?
I would probably prefer channels over callbacks since it gives clients more control over resource allocation, but it also adds more client development overhead since error readers will have to be created and managed to consume from the channel.
TLDR tradeoffs either way
@Evanjt1 Channels in public APIs are discouraged over callbacks due to a few significant drawbacks: https://dave.cheney.net/practical-go/presentations/qcon-china.html#_leave_concurrency_to_the_caller
I'd also love to have a callback that allows me to decide how I want to proceed with the message that failed to be produced (after internal retries).
I took a stab at a naive implementation via #420
Hello, are there any updates on #420 ??
I would really like to use async mode with ability to log/receive error
Thx a lot for this 😄
Logging errors still happens in async mode, the errors are sent to the logger configured on the writer.
I am curious to understand what type of logic would be implemented when capturing errors on async writer; it seems that anything other than just logging the error would be better served by using the writer in sync mode so the application receives back-pressure when normal operations cannot be performed.
Mostly metrics in our case, which serve as the basis for monitors and alerts. A high rate of async produce failures should start ringing bells, but we shouldn't be sacrificing throughput if we don't need to.
The reason I want to stick with async==true
: I don't want to write retry logic when this one is very good.
What happens when in async==true
mode I flood the queue?
I would think that knowing these errors I could have a circuit breaker that would stop the writes.
Currently, if
Writer
is in async mode, ie:WriterConfig.Async == true
, any errors that occur when writing the messages to kafka are ignored. It would be nice for a variety of use cases to have visibility of these errors while in async mode. I'd specifically like visibility so I can process failed messages and possibly send to dead-letter queue without losing the niceties of sending asynchronously.As a possible implementation, I was thinking of adding an errors channel (
chan WriterError
) which a client can read from retrieve the errors. The channel would be buffered and the size would be controlled by the WriterConfig. Sarama has something similar, see https://github.com/Shopify/sarama/blob/6e063e619ea7e2bd19a18782a3c727cf0d6b0c3e/async_producer.go#L47Clients will need to read from the error channel otherwise it will block the partitionWriters. This functionality would need to be completely optional.
401 will provide the error type needed to support this so it feels like a natural enhancement to make. Interested to hear feedback