Closed fragoulis closed 3 years ago
Great write-up! 💯
IIRC the reason we decided to treat non-delivery events as errors was that we had not much experience with librdkafka back then and we observed that no such events were indeed reporter. So we (wrongly) assumed that this would be an exceptional case that generally shouldn't happen and therefore that we should be notified about it.
Apparently our assumptions were wrong. As you suggested, the situation might be improved with https://github.com/edenhill/librdkafka/commit/bea2d634459a18d970fea29e69329efaa294101b (note that we already set log.connection.close=false
.
This looks good to me!
P.S. The CI failures are due to the changes in docker-compose.yml
. I know, it works on some systems and breaks in others, depending on the installed Docker version. I'd suggest to leave this unchanged and make any potential improvements in a separate PR.
On a second thought, I actually think we should only ignore "Connection reset by peer" errors, not all errors.
I just realized that we recently received other delivery errors that we would not like to ignore (i.e. we should be notified of those):
No route to host (after 1079ms in state CONNECT))
Connection refused (after 24ms in state CONNECT))
With https://github.com/skroutz/rafka/pull/90/commits/39e8515971788f47eab1b8bc4ffbe659fc13f1fa I am exposing a way to handle connection errors.
This could also be driven in several directions (for example ignore broken connections with peers) etc.
I suggest, at first we leave this the way it see and observer the error rate and type.
The important thing about this PR is that we will not mark connection errors are delivery errors.
This also replies to https://github.com/skroutz/rafka/pull/90#discussion_r509938375
I suggest that you also check for errors
This is already being done https://github.com/skroutz/rafka/pull/90/files#diff-86f23d296531dee41d5e2cf9ef90844267173280c625dbe2738cd163d2278168R95-R97
A note on the commits: The final PR will consist of a single commit. The https://github.com/skroutz/rafka/pull/90/commits/eaabe0630c5b060a6bd221d7a4f88d060c1ae299 will be cherry-picked to master separately since it has nothing to do with the PR, I just wanted to be shown.
Τα tests σπανε με librdkafka 0.11.6 και στο μαστερ, δεν ξερω γιατί, δε μπορώ να το βρω σε αυτή τη φάση.
From confluent-kafka-go documentation:
However, what they do not mention is that the
.Events()
channel does not always consist of*kafka.Message
s. This is not so obvious, however they somewhat admit it in the how to setup the producer example:Not only do they say "not all objects in the events channel are kafka messages", but they are also ignorrable (you can optionally log them).
What we present as producer errors are actually these ignorable events, and not delivery reports about delivery failures. We do this in the producer and this was most likely done because of a misunderstanding of what these non-kafka-messages were.
The message is:
The
Error consuming delivery event: Unknown event type
part is from us, from rafka. TheReceive failed
part is from librdkafka. TheConnection reset by peer
part is from tcp, from when recv returns ECONNRESET . This means that the connection with the broker (peer) has been lost. The(after 1199998ms in state UP)
part is also from librdkafka.Q1 Why/how do we get this string in the
.Events()
channel?.Events()
consist of many types of structures as we can see from the poller and the eventPoll method.Q2 Why are we getting those disconnections?
The librdkafka devs have already made a list of things that could be happening. While the first bullet on that list sounds like the most likely scenario, it is worthy of mention, that they have somehow tried to patch up that behavior with this commit:
It seems to me that if what we are seeing is indeed idle connection reaping, it should be handled more gracefully. However, even the developers actually suggest that we do not take that behavior for granted in the comments.
My suggestion is to look into the options provided and maybe adjust, after having more data regading whether it is the broker that disconnects because of idling etc (for example increase the
socket.timeout.ms
).This is most probably something we can ignore and no messages are being lost.