Resetting generation due to consumer pro-actively leaving the group

matanbaruch commented 4 years ago

Kafka version 2.4.1.1 (AWS MSK) kafka-lag-exporter version 0.6.4 Debug enabled Attached logs.txt

application.conf

> kafka-lag-exporter {
>   reporters.prometheus.port = 8000
>   poll-interval = 300
>   lookup-table-size = 60
>   clusters = [
>     {
>       name = "KafkaProdOregon"
>       bootstrap-brokers = "b-1.XXX.XXX.XXX.kafka.us-west-2.amazonaws.com:9092,b-2..XXX.XXX.XXX.kafka.us-west-2.amazonaws.com:9092,b-3.XXX.XXX.XXX.kafka.us-west-2.amazonaws.com:9092"
>       kafka-client-timeout = 300
>       labels = {
>         location = "oregon"
>         environment = "production"
>         exporter = "kafka-lag"
>       }
>     }
>   ]
> }

seglo commented 4 years ago

The AdminClient timed out when listing groups, but it looks like it was beginning to recover near the end of the log. Is this a consistent issue or do you only see it occasionally?

2020-10-01 11:49:41,946 ERROR c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-KafkaProdOregon - Supervisor RestartSupervisor saw failure: A failure occurred while retrieving offsets.  Shutting down. java.lang.Exception: A failure occurred while retrieving offsets.  Shutting down.
    at com.lightbend.kafkalagexporter.ConsumerGroupCollector$CollectorBehavior.$anonfun$collector$1(ConsumerGroupCollector.scala:214)
    at akka.actor.typed.internal.BehaviorImpl$ReceiveBehavior.receive(BehaviorImpl.scala:136)
    at akka.actor.typed.Behavior$.interpret(Behavior.scala:274)
    at akka.actor.typed.Behavior$.interpretMessage(Behavior.scala:230)
    at akka.actor.typed.internal.InterceptorImpl$$anon$2.apply(InterceptorImpl.scala:57)
    at akka.actor.typed.internal.RestartSupervisor.aroundReceive(Supervision.scala:263)
    at akka.actor.typed.internal.InterceptorImpl.receive(InterceptorImpl.scala:85)
    at akka.actor.typed.Behavior$.interpret(Behavior.scala:274)
    at akka.actor.typed.Behavior$.interpretMessage(Behavior.scala:230)
    at akka.actor.typed.internal.adapter.ActorAdapter.handleMessage(ActorAdapter.scala:129)
    at akka.actor.typed.internal.adapter.ActorAdapter.aroundReceive(ActorAdapter.scala:106)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:577)
    at akka.actor.ActorCell.invoke(ActorCell.scala:547)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
    at akka.dispatch.Mailbox.run(Mailbox.scala:231)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1601552980162) timed out at 1601552980163 after 1 attempt(s)
    at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
    at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
    at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
    at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
    at com.lightbend.kafkalagexporter.KafkaClient$.$anonfun$kafkaFuture$1(KafkaClient.scala:50)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1601552980162) timed out at 1601552980163 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call.

matanbaruch commented 4 years ago

This is consistent issue. I can't event get the metrics. The Prometheus endpoint only serve JVM Metrics.

seglo commented 4 years ago

What I meant was does it ever produce Kafka Lag Exporter metrics or is it never able to connect to the Kafka cluster? If it always times out then I would suggest troubleshooting that connectivity issue.

matanbaruch commented 4 years ago

It never produce Lag Exporter metrics.. There is no connection issue, I’m running different exporters from the same machine all of them works fine.

matanbaruch commented 4 years ago

Anyone?

seglo commented 4 years ago

Can you include more logs? Do they ever indicate that the consumer or admin client even successfully connected to your cluster?

matanbaruch commented 4 years ago

The log file is included in the main post. I raised to debug level.

It looks like they never successfully connected to the cluster. The cluster works with non ACL.

seglo commented 4 years ago

Unfortunately I don't have much experience configuring ACLs with Kafka clients, but other have used Kafka Lag Exporter successfully in secured environments.

Based on some cursory google searches for Call(callName=findCoordinator, deadlineMs=[timeout]) timed out it seems the problem is generally due to client configuration errors. Or possibly misconfiguration in the brokers advertised listeners, but that's probably not the case for you since you say other clients can connect to the cluster fine. I would carefully check the config of those other clients with what you're providing to Kafka Lag Exporter to see where the difference might be.

seglo / kafka-lag-exporter

Resetting generation due to consumer pro-actively leaving the group #165