Producer ignorable events

fragoulis commented 3 years ago

// * Producing is an asynchronous operation so the client notifies the application
// of per-message produce success or failure through something called delivery reports.
// Delivery reports are by default emitted on the `.Events()` channel as `*kafka.Message`
// and you should check `msg.TopicPartition.Error` for `nil` to find out if the message
// was succesfully delivered or not.

However, what they do not mention is that the .Events() channel does not always consist of *kafka.Messages. This is not so obvious, however they somewhat admit it in the how to setup the producer example:

switch ev := e.(type) {
case *kafka.Message:
  ...
  return

default:
  fmt.Printf("Ignored event: %s\n", ev)

Not only do they say "not all objects in the events channel are kafka messages", but they are also ignorrable (you can optionally log them).

What we present as producer errors are actually these ignorable events, and not delivery reports about delivery failures. We do this in the producer and this was most likely done because of a misunderstanding of what these non-kafka-messages were.

The message is:

Sep  3 01:07:50 foo rafka[9730]: [producer-49582] 2020/09/03 01:07:50 Error consuming delivery event: Unknown event type (broker:9092/4: Receive failed: Connection reset by peer (after 1199998ms in state UP))

The Error consuming delivery event: Unknown event type part is from us, from rafka. The Receive failed part is from librdkafka. The Connection reset by peer part is from tcp, from when recv returns ECONNRESET . This means that the connection with the broker (peer) has been lost. The (after 1199998ms in state UP) part is also from librdkafka.

Q1 Why/how do we get this string in the `.Events()` channel?

.Events() consist of many types of structures as we can see from the poller and the eventPoll method.

Q2 Why are we getting those disconnections?

The librdkafka devs have already made a list of things that could be happening. While the first bullet on that list sounds like the most likely scenario, it is worthy of mention, that they have somehow tried to patch up that behavior with this commit:

The broker will disconnect clients that have not sent any protocol requests
within `connections.max.idle.ms` (broker configuration propertion, defaults
to 10 minutes), but there is no fool proof way for the client to know that it
was a deliberate close by the broker and not an error. To avoid logging these
deliberate idle disconnects as errors the client employs some logic to try to
classify a disconnect as an idle disconnect if no requests have been sent in
the last `socket.timeout.ms` or there are no outstanding, or
queued, requests waiting to be sent. In this case the standard "Disconnect"
error log is silenced (will only be seen with debug enabled).

It seems to me that if what we are seeing is indeed idle connection reaping, it should be handled more gracefully. However, even the developers actually suggest that we do not take that behavior for granted in the comments.

My suggestion is to look into the options provided and maybe adjust, after having more data regading whether it is the broker that disconnects because of idling etc (for example increase the socket.timeout.ms).

This is most probably something we can ignore and no messages are being lost.

agis commented 3 years ago

Great write-up! 💯

IIRC the reason we decided to treat non-delivery events as errors was that we had not much experience with librdkafka back then and we observed that no such events were indeed reporter. So we (wrongly) assumed that this would be an exceptional case that generally shouldn't happen and therefore that we should be notified about it.

Apparently our assumptions were wrong. As you suggested, the situation might be improved with https://github.com/edenhill/librdkafka/commit/bea2d634459a18d970fea29e69329efaa294101b (note that we already set log.connection.close=false.

This looks good to me!

P.S. The CI failures are due to the changes in docker-compose.yml. I know, it works on some systems and breaks in others, depending on the installed Docker version. I'd suggest to leave this unchanged and make any potential improvements in a separate PR.

agis commented 3 years ago

On a second thought, I actually think we should only ignore "Connection reset by peer" errors, not all errors.

I just realized that we recently received other delivery errors that we would not like to ignore (i.e. we should be notified of those):

No route to host (after 1079ms in state CONNECT))
Connection refused (after 24ms in state CONNECT))

fragoulis commented 3 years ago

With https://github.com/skroutz/rafka/pull/90/commits/39e8515971788f47eab1b8bc4ffbe659fc13f1fa I am exposing a way to handle connection errors.

This could also be driven in several directions (for example ignore broken connections with peers) etc.

I suggest, at first we leave this the way it see and observer the error rate and type.

The important thing about this PR is that we will not mark connection errors are delivery errors.

This also replies to https://github.com/skroutz/rafka/pull/90#discussion_r509938375

fragoulis commented 3 years ago

I suggest that you also check for errors

This is already being done https://github.com/skroutz/rafka/pull/90/files#diff-86f23d296531dee41d5e2cf9ef90844267173280c625dbe2738cd163d2278168R95-R97

fragoulis commented 3 years ago

A note on the commits: The final PR will consist of a single commit. The https://github.com/skroutz/rafka/pull/90/commits/eaabe0630c5b060a6bd221d7a4f88d060c1ae299 will be cherry-picked to master separately since it has nothing to do with the PR, I just wanted to be shown.

fragoulis commented 3 years ago

Τα tests σπανε με librdkafka 0.11.6 και στο μαστερ, δεν ξερω γιατί, δε μπορώ να το βρω σε αυτή τη φάση.

skroutz / rafka