seglo / kafka-lag-exporter

Monitor Kafka Consumer Group Latency with Kafka Lag Exporter
Apache License 2.0
651 stars 191 forks source link

Error - Supervisor RestartSupervisor saw failure: A failure occurred while retrieving offsets. Shutting down. java.lang.Exception: #98

Closed rahulchak closed 4 years ago

rahulchak commented 4 years ago

We are using Confluent Kafka secure cluster and I provided in the application.conf the section admin-client-properties with ssl key & trust store file locations and passwords, however when it starts all the ssl parameters are null, seems like it is not reading the admin-client-properties section.

seglo commented 4 years ago

@rahulchak Thanks for using Kafka lag exporter. Can you please confirm what version you're using and attach a snippet of the log with the exception stack trace?

rahulchak commented 4 years ago

Thank you for your quick response. I believe the version is - 0.5.5 and i have attached the log here. I am not sure how to get the exception stack trace though. Below is the contents of my conf file

*conf file***** kafka-lag-exporter { port = 9080 kafka-client-timeout = 20 clusters = [ { name = "stormcloud-devtest" bootstrap-brokers = "test.com:9092" ssl.keystore.location = "/var/ssl/private/star.datafabric.gcp.west.com.pkcs12" ssl.key.password = "" ssl.keystore.password = "" ssl.truststore.location = "/var/ssl/private/truststore.pkcs12" ssl.truststore.password = "" sasl.mechanism = "PLAIN" sasl.jaas.config = "org.apache.kafka.common.security.plain.PlainLoginModule required username='""' password='"***"';"

  labels = {
    location = "ny"
    zone = "us-west1"
  }
  dmin-client-properties = {
    client.id = "admin-client-id"
    ssl.keystore.location = "/var/ssl/private/star.datafabric.gcp.west.com.pkcs12"
    ssl.key.password = ""*******""
    ssl.keystore.password = ""*******""
    ssl.truststore.location = "/var/ssl/private/truststore.pkcs12"
    ssl.truststore.password = ""*******""
    sasl.mechanism = "PLAIN"
    sasl.jaas.config = "org.apache.kafka.common.security.plain.PlainLoginModule required username='"*******"' password='"*******"';"
  }

}

] }

***** Log **** kafkalogexporter_log.txt

ccotar commented 4 years ago

Having a similar error, but connected to Cloud Karafka from our EKS through VPC:

Version: 0.5.5

Stacktrace:

Dec 13 14:08:18 kafka-lag-exporter-768d88df9f-hhzf2 kafka-lag-exporter INFO c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-app - Collecting offsets 
Dec 13 14:08:28 kafka-lag-exporter-768d88df9f-hhzf2 kafka-lag-exporter INFO o.a.k.c.a.i.AdminMetadataManager  - [AdminClient clientId=adminclient-598] Metadata update failed org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call.
Dec 13 14:08:28 kafka-lag-exporter-768d88df9f-hhzf2 kafka-lag-exporter ERROR c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-app - Supervisor RestartSupervisor saw failure: A failure occurred while retrieving offsets.  Shutting down. java.lang.Exception: A failure occurred while retrieving offsets.  Shutting down.
    at com.lightbend.kafkalagexporter.ConsumerGroupCollector$CollectorBehavior.$anonfun$collector$1(ConsumerGroupCollector.scala:188)
    at akka.actor.typed.internal.BehaviorImpl$ReceiveBehavior.receive(BehaviorImpl.scala:37)
    at akka.actor.typed.Behavior$.interpret(Behavior.scala:437)
    at akka.actor.typed.Behavior$.interpretMessage(Behavior.scala:393)
    at akka.actor.typed.internal.InterceptorImpl$$anon$2.apply(InterceptorImpl.scala:52)
    at akka.actor.typed.internal.RestartSupervisor.aroundReceive(Supervision.scala:248)
    at akka.actor.typed.internal.InterceptorImpl.receive(InterceptorImpl.scala:79)
    at akka.actor.typed.Behavior$.interpret(Behavior.scala:437)
    at akka.actor.typed.Behavior$.interpretMessage(Behavior.scala:393)
    at akka.actor.typed.internal.adapter.ActorAdapter.handleMessage(ActorAdapter.scala:121)
    at akka.actor.typed.internal.adapter.ActorAdapter.aroundReceive(ActorAdapter.scala:102)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
    at akka.actor.ActorCell.invoke(ActorCell.scala:581)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
    at akka.dispatch.Mailbox.run(Mailbox.scala:229)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.TimeoutException: null
    at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
    at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
    at com.lightbend.kafkalagexporter.KafkaClient$.$anonfun$kafkaFuture$1(KafkaClient.scala:50)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:658)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Dec 13 14:08:29 kafka-lag-exporter-768d88df9f-hhzf2 kafka-lag-exporter INFO c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-app - Spawned ConsumerGroupCollector for cluster: app 
Dec 13 14:08:29 kafka-lag-exporter-768d88df9f-hhzf2 kafka-lag-exporter INFO o.a.k.c.consumer.ConsumerConfig  - ConsumerConfig values: 
    allow.auto.create.topics = true
    auto.commit.interval.ms = 5000
    auto.offset.reset = latest
    bootstrap.servers = [**********:9092]
    check.crcs = true
    client.dns.lookup = default
    client.id = 
    client.rack = 
    connections.max.idle.ms = 540000
    default.api.timeout.ms = 60000
    enable.auto.commit = false
    exclude.internal.topics = true
    fetch.max.bytes = 52428800
    fetch.max.wait.ms = 500
    fetch.min.bytes = 1
    group.id = kafkalagexporter
    group.instance.id = null
    heartbeat.interval.ms = 3000
    interceptor.classes = []
    internal.leave.group.on.close = true
    isolation.level = read_uncommitted
    key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
    max.partition.fetch.bytes = 1048576
    max.poll.interval.ms = 300000
    max.poll.records = 500
    metadata.max.age.ms = 300000
    metric.reporters = []
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
    receive.buffer.bytes = 65536
    reconnect.backoff.max.ms = 1000
    reconnect.backoff.ms = 50
    request.timeout.ms = 10000
    retry.backoff.ms = 1000
    sasl.client.callback.handler.class = null
    sasl.jaas.config = null
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    sasl.kerberos.min.time.before.relogin = 60000
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    sasl.kerberos.ticket.renew.window.factor = 0.8
    sasl.login.callback.handler.class = null
    sasl.login.class = null
    sasl.login.refresh.buffer.seconds = 300
    sasl.login.refresh.min.period.seconds = 60
    sasl.login.refresh.window.factor = 0.8
    sasl.login.refresh.window.jitter = 0.05
    sasl.mechanism = GSSAPI
    security.protocol = PLAINTEXT
    send.buffer.bytes = 131072
    session.timeout.ms = 10000
    ssl.cipher.suites = null
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    ssl.endpoint.identification.algorithm = https
    ssl.key.password = null
    ssl.keymanager.algorithm = SunX509
    ssl.keystore.location = null
    ssl.keystore.password = null
    ssl.keystore.type = JKS
    ssl.protocol = TLS
    ssl.provider = null
    ssl.secure.random.implementation = null
    ssl.trustmanager.algorithm = PKIX
    ssl.truststore.location = null
    ssl.truststore.password = null
    ssl.truststore.type = JKS
    value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer

This issue was PEBKAC, AKA my fault. Was a kafka server IP mismatch for environments when I was upgrading with helm to change IPs for clusters. 🤦‍♂

zhoulouzi commented 4 years ago

same error with kafka version 1.0.1

seglo commented 4 years ago

@rahulchak @ccotar What version of Kafka (or Confluent Platform Kafka) clusters are you connecting to? I've only tested Kafka Lag Exporter with a limited number of broker versions, all of which are > 2.0.0.

ccotar commented 4 years ago

@seglo my issue was my own fault, somehow had the wrong cluster IP during help upgrade. My bad. 👍

seglo commented 4 years ago

@ccotar I'm glad you got it resolved. Thanks for following up.

dmtrrk commented 4 years ago

@seglo have the same issue with kafka 2.0.0 I have a single cluster in my test with name + bootstrap-servers parameters. On the same workstation I can run a simple Confluent SDK based app with the same server name.

afiffing commented 4 years ago

Hi @seglo , Please let me know, is there any update on the same? Faced the same issue.

Here is the conf file that I am using

kafka-lag-exporter {
  reporters.prometheus.port = 8080
  poll-interval = 30 seconds
  lookup-table-size = 60
  clusters = [
    {
      name = "confluent-cluster-1"
      bootstrap-brokers = "confluent-cluster-endpoint:9092"
      consumer-properties = {
        ssl.endpoint.identification.algorithm = "https"
        sasl.mechanism = "PLAIN"
        retry.backoff.ms = "500"
        sasl.jaas.config = "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"******\" password=\"*******\";"
        security.protocol = "SASL_SSL"
        client.id = "devops--consumerlag_exporter--all"
      }
      labels = {
        location = "sgp"
        zone = "ap-southeast-1"
      }
    }
  ]
  kafka-client-timeout = 30 seconds
  metric-whitelist = [".*"]
}

***** Log ****

2020-08-27 10:58:06,059 INFO  akka.event.slf4j.Slf4jLogger  - Slf4jLogger started
2020-08-27 10:58:06,161 INFO  akka.actor.typed.Behavior akka://kafka-lag-exporter/user - Starting Kafka Lag Exporter with configuration:
Poll interval: 30 seconds
Lookup table size: 60
Prometheus metrics endpoint port: 8000
Admin client consumer group id: kafkalagexporter
Kafka client timeout: 30 seconds
Statically defined Clusters:
  Cluster name: confluent-cluster-1
  Cluster Kafka bootstrap brokers: confluent-cluster-endpoint:9092
Watchers:
  Strimzi: false
2020-08-27 10:58:06,242 INFO  akka.actor.typed.Behavior akka://kafka-lag-exporter/user - Cluster Added: KafkaCluster(confluent-cluster-1,confluent-cluster-endpoint:9092)
2020-08-27 10:58:06,253 INFO  akka.actor.typed.Behavior akka://kafka-lag-exporter/user/consumer-group-collector-confluent-cluster-1 - Spawned ConsumerGroupCollector for cluster: confluent-cluster-1
2020-08-27 10:58:06,273 INFO  o.a.k.c.admin.AdminClientConfig  - AdminClientConfig values:
    bootstrap.servers = [confluent-cluster-endpoint:9092]
    client.dns.lookup = default
    client.id =
    connections.max.idle.ms = 300000
    metadata.max.age.ms = 300000
    metric.reporters = []
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    receive.buffer.bytes = 65536
    reconnect.backoff.max.ms = 1000
    reconnect.backoff.ms = 50
    request.timeout.ms = 30000
    retries = 0
    retry.backoff.ms = 1000
    sasl.client.callback.handler.class = null
    sasl.jaas.config = null
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    sasl.kerberos.min.time.before.relogin = 60000
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    sasl.kerberos.ticket.renew.window.factor = 0.8
    sasl.login.callback.handler.class = null
    sasl.login.class = null
    sasl.login.refresh.buffer.seconds = 300
    sasl.login.refresh.min.period.seconds = 60
    sasl.login.refresh.window.factor = 0.8
    sasl.login.refresh.window.jitter = 0.05
    sasl.mechanism = GSSAPI
    security.protocol = PLAINTEXT
    send.buffer.bytes = 131072
    ssl.cipher.suites = null
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    ssl.endpoint.identification.algorithm = https
    ssl.key.password = null
    ssl.keymanager.algorithm = SunX509
    ssl.keystore.location = null
    ssl.keystore.password = null
    ssl.keystore.type = JKS
    ssl.protocol = TLS
    ssl.provider = null
    ssl.secure.random.implementation = null
    ssl.trustmanager.algorithm = PKIX
    ssl.truststore.location = null
    ssl.truststore.password = null
    ssl.truststore.type = JKS
2020-08-27 10:58:06,352 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka version : 2.1.0
2020-08-27 10:58:06,352 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka commitId : eec43959745f444f
2020-08-27 10:58:36,357 INFO  o.a.k.c.a.i.AdminMetadataManager  - [AdminClient clientId=adminclient-1] Metadata update failed org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
2020-08-27 10:58:36,359 INFO  o.a.k.c.a.i.AdminMetadataManager  - [AdminClient clientId=adminclient-1] Metadata update failed org.apache.kafka.common.errors.TimeoutException: The AdminClient thread has exited.
2020-08-27 10:58:36,368 INFO  o.a.k.c.consumer.ConsumerConfig  - ConsumerConfig values:
    auto.commit.interval.ms = 5000
    auto.offset.reset = latest
    bootstrap.servers = [confluent-cluster-endpoint:9092]
    check.crcs = true
    client.dns.lookup = default
    client.id =
    connections.max.idle.ms = 540000
    default.api.timeout.ms = 60000
    enable.auto.commit = false
    exclude.internal.topics = true
    fetch.max.bytes = 52428800
    fetch.max.wait.ms = 500
    fetch.min.bytes = 1
    group.id = kafkalagexporter
    heartbeat.interval.ms = 3000
    interceptor.classes = []
    internal.leave.group.on.close = true
    isolation.level = read_uncommitted
    key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
    max.partition.fetch.bytes = 1048576
    max.poll.interval.ms = 300000
    max.poll.records = 500
    metadata.max.age.ms = 300000
    metric.reporters = []
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
    receive.buffer.bytes = 65536
    reconnect.backoff.max.ms = 1000
    reconnect.backoff.ms = 50
    request.timeout.ms = 30000
    retry.backoff.ms = 1000
    sasl.client.callback.handler.class = null
    sasl.jaas.config = null
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    sasl.kerberos.min.time.before.relogin = 60000
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    sasl.kerberos.ticket.renew.window.factor = 0.8
    sasl.login.callback.handler.class = null
    sasl.login.class = null
    sasl.login.refresh.buffer.seconds = 300
    sasl.login.refresh.min.period.seconds = 60
    sasl.login.refresh.window.factor = 0.8
    sasl.login.refresh.window.jitter = 0.05
    sasl.mechanism = GSSAPI
    security.protocol = PLAINTEXT
    send.buffer.bytes = 131072
    session.timeout.ms = 10000
    ssl.cipher.suites = null
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    ssl.endpoint.identification.algorithm = https
    ssl.key.password = null
    ssl.keymanager.algorithm = SunX509
    ssl.keystore.location = null
    ssl.keystore.password = null
    ssl.keystore.type = JKS
    ssl.protocol = TLS
    ssl.provider = null
    ssl.secure.random.implementation = null
    ssl.trustmanager.algorithm = PKIX
    ssl.truststore.location = null
    ssl.truststore.password = null
    ssl.truststore.type = JKS
    value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
2020-08-27 10:58:36,401 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka version : 2.1.0
2020-08-27 10:58:36,401 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka commitId : eec43959745f444f
2020-08-27 10:58:36,411 ERROR akka.actor.typed.Behavior akka://kafka-lag-exporter/user/consumer-group-collector-confluent-cluster-1 - Supervisor RestartSupervisor saw failure: A failure occurred while retrieving offsets.  Shutting down. java.lang.Exception: A failure occurred while retrieving offsets.  Shutting down.
    at com.lightbend.kafkalagexporter.ConsumerGroupCollector$.$anonfun$collector$1(ConsumerGroupCollector.scala:125)
    at akka.actor.typed.internal.BehaviorImpl$ReceiveBehavior.receive(BehaviorImpl.scala:34)
    at akka.actor.typed.Behavior$.interpret(Behavior.scala:421)
    at akka.actor.typed.Behavior$.interpretMessage(Behavior.scala:394)
    at akka.actor.typed.internal.InterceptorImpl$$anon$2.apply(InterceptorImpl.scala:50)
    at akka.actor.typed.internal.RestartSupervisor.aroundReceive(Supervision.scala:229)
    at akka.actor.typed.internal.InterceptorImpl.receive(InterceptorImpl.scala:74)
    at akka.actor.typed.Behavior$.interpret(Behavior.scala:421)
    at akka.actor.typed.Behavior$.interpretMessage(Behavior.scala:394)
    at akka.actor.typed.internal.adapter.ActorAdapter.akka$actor$typed$internal$adapter$ActorAdapter$$handleMessage(ActorAdapter.scala:82)
    at akka.actor.typed.internal.adapter.ActorAdapter$$anonfun$running$1.applyOrElse(ActorAdapter.scala:78)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.typed.internal.adapter.ActorAdapter.aroundReceive(ActorAdapter.scala:39)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.KafkaException: Failed to find brokers to send ListGroups
    at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
    at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
    at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
    at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
    at com.lightbend.kafkalagexporter.KafkaClient$.$anonfun$kafkaFuture$1(KafkaClient.scala:44)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:658)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.KafkaException: Failed to find brokers to send ListGroups
    at org.apache.kafka.clients.admin.KafkaAdminClient$22.handleFailure(KafkaAdminClient.java:2610)
    at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:614)
    at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:730)
    at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutPendingCalls(KafkaAdminClient.java:798)
    at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1092)
    ... 1 common frames omitted
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
2020-08-27 10:58:37,498 INFO  akka.actor.typed.Behavior akka://kafka-lag-exporter/user/consumer-group-collector-confluent-cluster-1 - Spawned ConsumerGroupCollector for cluster: confluent-cluster-1
2020-08-27 10:58:37,505 INFO  o.a.k.c.admin.AdminClientConfig  - AdminClientConfig values:
    bootstrap.servers = [confluent-cluster-endpoint:9092]
    client.dns.lookup = default
    client.id =
    connections.max.idle.ms = 300000
    metadata.max.age.ms = 300000
    metric.reporters = []
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    receive.buffer.bytes = 65536
    reconnect.backoff.max.ms = 1000
    reconnect.backoff.ms = 50
    request.timeout.ms = 30000
    retries = 0
    retry.backoff.ms = 1000
    sasl.client.callback.handler.class = null
    sasl.jaas.config = null
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    sasl.kerberos.min.time.before.relogin = 60000
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    sasl.kerberos.ticket.renew.window.factor = 0.8
    sasl.login.callback.handler.class = null
    sasl.login.class = null
    sasl.login.refresh.buffer.seconds = 300
    sasl.login.refresh.min.period.seconds = 60
    sasl.login.refresh.window.factor = 0.8
    sasl.login.refresh.window.jitter = 0.05
    sasl.mechanism = GSSAPI
    security.protocol = PLAINTEXT
    send.buffer.bytes = 131072
    ssl.cipher.suites = null
    ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
    ssl.endpoint.identification.algorithm = https
    ssl.key.password = null
    ssl.keymanager.algorithm = SunX509
    ssl.keystore.location = null
    ssl.keystore.password = null
    ssl.keystore.type = JKS
    ssl.protocol = TLS
    ssl.provider = null
    ssl.secure.random.implementation = null
    ssl.trustmanager.algorithm = PKIX
    ssl.truststore.location = null
    ssl.truststore.password = null
    ssl.truststore.type = JKS
2020-08-27 10:58:37,509 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka version : 2.1.0
2020-08-27 10:58:37,509 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka commitId : eec43959745f444f
brunodomenici commented 4 years ago

Same here with Confluent Cloud. We opened #142 almost a month ago. Somehow it got back to work, but now isn't working again. It happen in one Confluent cluster in our case, we other that kafka-lag-exporter works fine... Thank you

seglo commented 4 years ago

@afiffing The title of this issue is the generic exception message when collecting offsets fails in Kafka Lag Exporter. It can fail for many different reasons. In your case the logs indicate the underlying exception is a org.apache.kafka.common.KafkaException: Failed to find brokers to send ListGroups, which may be related to a Kafka broker listener misconfiguration.

seglo commented 4 years ago

I'm closing this issue due to its generality. If you encounter this exception look at the inner exception included in the stack trace to see the root cause.

sanjaygarde commented 2 years ago

When I exec into the pod, I see application.conf as this:

kafka-lag-exporter { port = 8000 poll-interval = 30 seconds lookup-table-size = 60 client-group-id = "kafkalagexporter" kafka-client-timeout = 10 seconds clusters = [ { name = "lkc-0xr722" bootstrap-brokers = "xxx" consumer-properties = { sasl = "map[jaas:map[config:org.apache.kafka.common.security.plain.PlainLoginModule required username='LDMGW4BF4LOA5Z4I' password='xxx';] mechanism:PLAIN]" security = "map[protocol:SASL_SSL]" ssl = "map[endpoint:map[identification:map[algorithm:https]]]" } admin-client-properties = { sasl = "map[jaas:map[config:org.apache.kafka.common.security.plain.PlainLoginModule required username='LDMGW4BF4LOA5Z4I' password='xxx';] mechanism:PLAIN]" security = "map[protocol:SASL_SSL]" ssl = "map[endpoint:map[identification:map[algorithm:https]]]" } labels = { } } ] reporters.prometheus.port = 8000 sinks = ["PrometheusEndpointSink"] watchers = { strimzi = "false" } metric-whitelist = [ ".*" ] }

akka { loggers = ["akka.event.slf4j.Slf4jLogger"] loglevel = "DEBUG" logging-filter = "akka.event.slf4j.Slf4jLoggingFilter" }

Does this look normal? The normal application.conf file should be like:

  name = "lkc-0xr722"
  bootstrap-brokers = "xxx"
  security.protocol = "SASL_SSL"
  sasl.mechanism = "PLAIN"
  sasl.jaas.config = "org.apache.kafka.common.security.plain.PlainLoginModule required username='LDMGW4BF4LOA5Z4I' password='xxx';" 

  ...
  ...
  ...
sanjaygarde commented 2 years ago

When I exec into the pod, I see application.conf as this:

kafka-lag-exporter { port = 8000 poll-interval = 30 seconds lookup-table-size = 60 client-group-id = "kafkalagexporter" kafka-client-timeout = 10 seconds clusters = [ { name = "lkc-0xr722" bootstrap-brokers = "xxx" consumer-properties = { sasl = "map[jaas:map[config:org.apache.kafka.common.security.plain.PlainLoginModule required username='LDMGW4BF4LOA5Z4I' password='xxx';] mechanism:PLAIN]" security = "map[protocol:SASL_SSL]" ssl = "map[endpoint:map[identification:map[algorithm:https]]]" } admin-client-properties = { sasl = "map[jaas:map[config:org.apache.kafka.common.security.plain.PlainLoginModule required username='LDMGW4BF4LOA5Z4I' password='xxx';] mechanism:PLAIN]" security = "map[protocol:SASL_SSL]" ssl = "map[endpoint:map[identification:map[algorithm:https]]]" } labels = { } } ] reporters.prometheus.port = 8000 sinks = ["PrometheusEndpointSink"] watchers = { strimzi = "false" } metric-whitelist = [ ".*" ] }

akka { loggers = ["akka.event.slf4j.Slf4jLogger"] loglevel = "DEBUG" logging-filter = "akka.event.slf4j.Slf4jLoggingFilter" }

Does this look normal? The normal application.conf file should be like:

  name = "lkc-0xr722"
  bootstrap-brokers = "xxx"
  security.protocol = "SASL_SSL"
  sasl.mechanism = "PLAIN"
  sasl.jaas.config = "org.apache.kafka.common.security.plain.PlainLoginModule required username='LDMGW4BF4LOA5Z4I' password='xxx';" 

  ...
  ...
  ...

I was able to fix this issue by fixing the configmap.