linkedin / Burrow

Kafka Consumer Lag Checking
Apache License 2.0
3.76k stars 801 forks source link

Burrow stops emitting metric after Kafka upgrade #827

Open arushi315 opened 3 months ago

arushi315 commented 3 months ago

Burrow Version: 1.8.0

Issue: After upgrading Kafka from version 3.6.x to 3.7.x, we observed that the Burrow service stopped emitting the consumer lag metric. Restarting the Burrow service temporarily resolved the issue.

Logs: The following warnings and errors were observed in the Burrow logs:

{"level":"error","ts":1720927697.84136,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":7}
.......
{"level":"warn","ts":1720927137.8406005,"msg":"error in OffsetResponse","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"kafka server: Tried to send a message to a replica that is not the leader for some partition. Your metadata is out of date","broker":3,"topic":"kafka-connect-offsets.internal","partition":4}

The Kafka upgrade was performed in a rolling fashion, one broker at a time. While communication disruptions were expected with the upgrading broker, others should have been available.

Burrow Configuration: Here is the configuration we are using:

[client-profile.profile]
kafka-version = "3.6.1"

[cluster.local-cluster]
client-profile = "profile"
class-name = "kafka"
topic-refresh = 60
offset-refresh = 10
groups-reaper-refresh = 10

[consumer.local-kafka]
class-name = "kafka"
cluster = "local-cluster"

[consumer.local-kafka-zk]
class-name = "kafka_zk"
cluster = "local-cluster"

[httpserver.default]
address = "{{ $http_address }}"

[logging]
level = "{{ $log_level }}"

Note: The kafka-version is set to 3.6.1, but as mentioned earlier, Burrow works fine with Kafka 3.7.x after a restart, so this does not seem to be the root cause.

Request:

Please let me know if additional information is required. Thank you!

arushi315 commented 3 months ago

Looks like the issue is intermittent because I am not able to reproduce this when I am upgrading kafka. During the upgrade metric does stop for a few but once upgrad has completed, it starts showing up again without having to restart burrow.

arushi315 commented 3 months ago

For the kafka cluster where we originally noticed the cluster, we have 9 brokers and observed EOF with all 9 brokers:

{"level":"error","ts":1720926917.8395112,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":1}
{"level":"error","ts":1720927037.8437417,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":2}
{"level":"error","ts":1720927147.84239,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":3}
{"level":"error","ts":1720927257.8391266,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":4}
{"level":"error","ts":1720927377.8412945,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":5}
{"level":"error","ts":1720927507.8451424,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":6}
{"level":"error","ts":1720927697.84136,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":7}
{"level":"error","ts":1720927887.8561947,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":8}
{"level":"error","ts":1720928077.8389344,"msg":"failed to fetch offsets from broker","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","sarama_error":"EOF","broker":9}
....
{"level":"error","ts":1720928077.8438833,"msg":"failed to get the list of available consumer groups","type":"module","coordinator":"cluster","class":"kafka","name":"local-cluster","error":"dial tcp 10.104.7.186:9092: connect: connection refused"}