linkedin / Burrow

Kafka Consumer Lag Checking
Apache License 2.0
3.73k stars 797 forks source link

Burrow consume error. #663

Open chandan-pathak opened 3 years ago

chandan-pathak commented 3 years ago

We using burrow to monitor consumer lag. We have two cluster

  1. Number of topics: 108
  2. Number of topics: 21

Burrow is working fine in the case of #2 while in the case of #1 cluster it has started giving consume error error intermittently. Can someone help us?

{"level":"warn","ts":1602512182.2992618,"msg":"failed to decode","type":"module","coordinator":"consumer","class":"kafka","name":"local","offset_topic":"consumer_offsets","offset_partition":13,"offset_offset":437696989,"message_type":"metadata","group":"xyz","reason":"value version","version":3} {"level":"error","ts":1602512183.8760965,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"error","ts":1602512185.8804855,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"error","ts":1602512187.883883,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"__consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"error","ts":1602512189.903179,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"error","ts":1602512191.9411907,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"warn","ts":1602512193.27345,"msg":"failed to decode","type":"module","coordinator":"consumer","class":"kafka","name":"local","offset_topic":"consumer_offsets","offset_partition":10,"offset_offset":358919584,"message_type":"metadata","group":"abc","reason":"value version","version":3} {"level":"error","ts":1602512193.944589,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"error","ts":1602512195.9564376,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"__consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"error","ts":1602512197.9608557,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."} {"level":"warn","ts":1602512199.9541771,"msg":"failed to decode","type":"module","coordinator":"consumer","class":"kafka","name":"local","offset_topic":"consumer_offsets","offset_partition":15,"offset_offset":439570387,"message_type":"metadata","group":"fgsdf","reason":"value version","version":3} {"level":"error","ts":1602512199.987991,"msg":"consume error","type":"module","coordinator":"consumer","class":"kafka","name":"local","topic":"consumer_offsets","partition":34,"error":"kafka server: Unexpected (unknown?) server error."}

chandan-pathak commented 3 years ago

The restart of the burrow resolves the error of partition:34, but some times later ist smae error appears for another partition. So the question I have is why it happens intermittently and randomly for different partitions?

bfncs commented 3 years ago

We see a similar problem with Burrow and Kafka. In the broker logs, it looks like an invalid fetch request is issued when fetching consumer offsets:

ERROR [ReplicaManager broker=0] Error processing fetch with max size -2147483648 from consumer on partition __consumer_offsets-31: (fetchOffset=270910250, logStartOffset=-1, maxBytes=-2147483648, currentLeaderEpoch=Optional.empty) (kafka.server.ReplicaManager)
java.lang.IllegalArgumentException: Invalid max size -2147483648 for log read from segment FileRecords(file= /var/lib/kafka/__consumer_offsets-31/00000000000000000000.log, start=0, end=2147483647)
    at kafka.log.LogSegment.read(LogSegment.scala:274)
    at kafka.log.Log$$anonfun$read$2.apply(Log.scala:1245)
    at kafka.log.Log$$anonfun$read$2.apply(Log.scala:1200)
    at kafka.log.Log.maybeHandleIOException(Log.scala:2013)
    at kafka.log.Log.read(Log.scala:1200)
    at kafka.cluster.Partition$$anonfun$readRecords$1.apply(Partition.scala:804)
    at kafka.cluster.Partition$$anonfun$readRecords$1.apply(Partition.scala:781)
    at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
    at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:257)
    at kafka.cluster.Partition.readRecords(Partition.scala:781)
    at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$read$1(ReplicaManager.scala:920)
    at kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:991)
    at kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:990)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:990)
    at kafka.server.ReplicaManager.readFromLog$1(ReplicaManager.scala:833)
    at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:845)
    at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:721)
    at kafka.server.KafkaApis.handle(KafkaApis.scala:116)
    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
    at java.lang.Thread.run(Thread.java:748)

The value for maxBytes should never be negative but is requested with -2147483648.

There's a possibly related issue in the Kafka bug tracker: KAFKA-7656, also possibly together with the sarama client. Please get in touch if we could provide more details to help fixing this problem.

Update: this was resolved for us by raising the configured kafka-version.