linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Unable to publish Cruise Control Metrics in Confluent Platform 5.3.1 #1296

Closed pedrojflores closed 4 years ago

pedrojflores commented 4 years ago

First time trying to get Cruise Control up and running in a three node Confluent Platform 5.3 Kafka cluster using mutual TLS auth. I followed the instructions at https://github.com/linkedin/cruise-control/blob/master/README.md and I'm currently running into an issue where according to my Kafka logs I'm not able to send Cruise Control metrics. Here's a sample of the log messages I'm seeing

[2020-08-02 20:43:02,318] WARN Failed to send Cruise Control metric [BROKER_METRIC,BROKER_FOLLOWER_FETCH_REQUEST_QUEUE_TIME_MS_MEAN,time=1596416882261,brokerId=1002,value=0.128] (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)

My server.prpoerties looks like this (some values are redacted)

advertised.listeners=TRUSTED://mybroker:9093
authorizer.class.name=io.confluent.kafka.security.ldap.authorizer.LdapAuthorizer
auto.create.topics.enable=false
broker.id.generation.enable=true
confluent.metrics.reporter.bootstrap.servers=mybrokerlist
confluent.metrics.reporter.security.protocol=SSL
confluent.metrics.reporter.ssl.keystore.location=/etc/ssl/kafka.server.keystore.jks
confluent.metrics.reporter.ssl.keystore.password=somepassword
confluent.metrics.reporter.ssl.truststore.location=/etc/ssl/kafka.server.truststore.jks
confluent.metrics.reporter.ssl.truststore.password=somepassword
confluent.metrics.reporter.topic.replicas=3
confluent.support.metrics.enable=true
group.initial.rebalance.delay.ms=3
inter.broker.listener.name=TRUSTED
ldap.authorizer.group.member.attribute.pattern=[Cc][Nn]=([^,]*),.*
ldap.authorizer.group.object.class=group
ldap.authorizer.group.search.base=myous
ldap.authorizer.group.search.scope=2
ldap.authorizer.java.naming.provider.url=ldaps://someadserver:636
ldap.authorizer.java.naming.security.authentication=SIMPLE
ldap.authorizer.java.naming.security.credentials=somecredentials
ldap.authorizer.java.naming.security.principal=someuserinad
ldap.authorizer.java.naming.security.protocol=SSL
ldap.authorizer.license=mylicensestring
ldap.authorizer.refresh.interval.ms=60000
ldap.authorizer.ssl.truststore.location=/etc/ssl/kafka.server.truststore.jks
ldap.authorizer.ssl.truststore.password=somepassword
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,TRUSTED:SSL
listeners=TRUSTED://:9093
log.dirs=/data/kafka_data/kafka-logs
log.retention.check.interval.ms=300000
log.retention.hours=168
log.segment.bytes=1073741824
metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsReporter,com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
num.io.threads=8
num.network.threads=3
num.partitions=10
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
ssl.client.auth=requested
ssl.keystore.location=/etc/ssl/kafka.server.keystore.jks
ssl.keystore.password=somepassword
ssl.principal.mapping.rules=RULE:^.*[Cc][Nn]=(.*).q2dc.local.*$/$1/,RULE:^.*[Cc][Nn]=([a-zA-Z0-9 -]*).*$/$1/,DEFAULT
ssl.truststore.location=/etc/ssl/kafka.server.truststore.jks
ssl.truststore.password=somepassword
super.users=someusers
Group:somegroup
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
zookeeper.connect=somezookeeperhosts
zookeeper.connection.timeout.ms=6000
cruise.control.metrics.reporter.ssl.truststore.location=/etc/ssl/kafka.server.truststore.jks
cruise.control.metrics.reporter.ssl.truststore.password=somepassword
cruise.control.metrics.reporter.ssl.keystore.location=/etc/ssl/kafka.server.keystore.jks
cruise.control.metrics.reporter.ssl.keystore.password=somepassword
cruise.control.metrics.reporter.bootstrap.servers=mybrokerlist

Cruise Control jar file location

# ls -al /usr/share/java/kafka/cruise-control-metrics-reporter-2.0.122.jar
-rw-r--r-- 1 root root 47151 Jul 29 08:55 /usr/share/java/kafka/cruise-control-metrics-reporter-2.0.122.jar

Cruise Control Topic

Topic:__CruiseControlMetrics    PartitionCount:10       ReplicationFactor:1     Configs:min.insync.replicas=1,cleanup.policy=delete,segment.bytes=1073741824,retention.ms=604800000,max.message.bytes=1000012,retention.bytes=-1,delete.retention.ms=86400000
        Topic: __CruiseControlMetrics   Partition: 0    Leader: 1001    Replicas: 1001  Isr: 1001
        Topic: __CruiseControlMetrics   Partition: 1    Leader: 1003    Replicas: 1003  Isr: 1003
        Topic: __CruiseControlMetrics   Partition: 2    Leader: 1001    Replicas: 1001  Isr: 1001
        Topic: __CruiseControlMetrics   Partition: 3    Leader: 1003    Replicas: 1003  Isr: 1003
        Topic: __CruiseControlMetrics   Partition: 4    Leader: 1001    Replicas: 1001  Isr: 1001
        Topic: __CruiseControlMetrics   Partition: 5    Leader: 1003    Replicas: 1003  Isr: 1003
        Topic: __CruiseControlMetrics   Partition: 6    Leader: 1001    Replicas: 1001  Isr: 1001
        Topic: __CruiseControlMetrics   Partition: 7    Leader: 1003    Replicas: 1003  Isr: 1003
        Topic: __CruiseControlMetrics   Partition: 8    Leader: 1001    Replicas: 1001  Isr: 1001
        Topic: __CruiseControlMetrics   Partition: 9    Leader: 1003    Replicas: 1003  Isr: 1003

Any help in figuring out what's preventing me from publishing metrics will be greatly appreciated.

efeg commented 4 years ago

Hi @pedrojflores To see the underlying producer exception received by Cruise Control metrics reporter, would you be able to enable debug level logs for the underlying producer and share the stack trace?

pedrojflores commented 4 years ago

Will do. Let me get that set up and will post the stack strace as soon I as get one.

pedrojflores commented 4 years ago

Here's what I'm seeing after setting the log level to DEBUG

2020-08-11 14:51:20,699] DEBUG [Producer clientId=CruiseControlMetricsReporter] Node -2 disconnected. (org.apache.kafka.clie
nts.NetworkClient)
[2020-08-11 14:51:20,699] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:20,749] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:20,799] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:20,839] DEBUG [ReplicaFetcher replicaId=1002, leaderId=1001, fetcherId=0] Node 1001 sent an incremental fet
ch response for session 454667497 with 0 response partition(s), 38 implied partition(s) (org.apache.kafka.clients.FetchSessio
nHandler)
[2020-08-11 14:51:20,839] DEBUG [ReplicaFetcher replicaId=1002, leaderId=1001, fetcherId=0] Built incremental fetch (sessionI
d=454667497, epoch=163) for node 1001. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 38 partiti
on(s) (org.apache.kafka.clients.FetchSessionHandler)
[2020-08-11 14:51:20,850] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:20,900] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:20,950] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,000] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,051] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,101] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,139] DEBUG [ReplicaFetcher replicaId=1002, leaderId=1003, fetcherId=0] Node 1003 sent an incremental fet
ch response for session 2110777851 with 0 response partition(s), 39 implied partition(s) (org.apache.kafka.clients.FetchSessi
onHandler)
[2020-08-11 14:51:21,346] DEBUG [ReplicaFetcher replicaId=1002, leaderId=1001, fetcherId=0] Built incremental fetch (sessionI
d=454667497, epoch=164) for node 1001. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 38 partiti
on(s) (org.apache.kafka.clients.FetchSessionHandler)
[2020-08-11 14:51:21,359] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,409] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,459] DEBUG [Producer clientId=CruiseControlMetricsReporter] Give up sending metadata request since no no
de is available (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,509] DEBUG [Producer clientId=CruiseControlMetricsReporter] Initialize connection to node <some node>:9093 (id: -1 rack: null) for sending metadata request (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,510] DEBUG [Producer clientId=CruiseControlMetricsReporter] Initiating connection to node <some node>:9093 (id: -1 rack: null) using address <someip> (org.apache.kafka.clients.
NetworkClient)
[2020-08-11 14:51:21,510] DEBUG [Producer clientId=CruiseControlMetricsReporter] Created socket with SO_RCVBUF = 32768, SO_SN
DBUF = 131072, SO_TIMEOUT = 0 to node -1 (org.apache.kafka.common.network.Selector)
[2020-08-11 14:51:21,510] DEBUG [Producer clientId=CruiseControlMetricsReporter] Completed connection to node -1. Fetching AP
I versions. (org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,510] DEBUG [Producer clientId=CruiseControlMetricsReporter] Initiating API versions fetch from node -1.
(org.apache.kafka.clients.NetworkClient)
[2020-08-11 14:51:21,566] DEBUG [Producer clientId=CruiseControlMetricsReporter] Connection with <some ip> disconnected (org.apache.kafka.common.network.Selector)
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:436)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:397)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:653)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:574)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:485)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:539)
        at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:335)
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:244)
        at java.base/java.lang.Thread.run(Thread.java:834)
efeg commented 4 years ago

@pedrojflores This is a configuration issue. You are missing at least the following config:

cruise.control.metrics.reporter.security.protocol=SSL

Hope it helps!

pedrojflores commented 4 years ago

Thanks @efeg

These are the cruise control options I'm using now and still having issues.

cruise.control.metrics.reporter.ssl.client.auth=requested
cruise.control.metrics.reporter.advertised.listeners=TRUSTED://myhostname:9093,PLAINTEXT://myhostname:9094
cruise.control.metrics.reporter.inter.broker.listener.name=TRUSTED
cruise.control.metrics.reporter.listeners=TRUSTED://:9093,PLAINTEXT://:9094
cruise.control.metrics.reporter.listener.security.protocol.map=PLAINTEXT:PLAINTEXT,TRUSTED:SSL
cruise.control.metrics.reporter.bootstrap.servers=list of servers listening on 9093
cruise.control.metrics.reporter.security.inter.broker.protocol=SSL
cruise.control.metrics.reporter.security.protocol=SSL
cruise.control.metrics.reporter.ssl.keystore.location=/etc/ssl/kafka.server.keystore.jks
cruise.control.metrics.reporter.ssl.keystore.password=mypassword
cruise.control.metrics.reporter.ssl.truststore.location=/etc/ssl/kafka.server.truststore.jks
cruise.control.metrics.reporter.ssl.truststore.password=mypassword

I'm getting these errors now which seem to be ssl related.

[2020-08-13 10:27:39,603] DEBUG [SocketServer brokerId=1002] Connection with /<some_broker> disconnected (org.apache.kafka.common.network.Selector)
java.io.EOFException
        at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96)
        at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:436)
        at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:397)
        at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:653)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:574)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:485)
        at kafka.network.Processor.poll(SocketServer.scala:884)
        at kafka.network.Processor.run(SocketServer.scala:783)
        at java.base/java.lang.Thread.run(Thread.java:834)

I'm not sure what other ssl related options I need to provide cruise control for the metrics reporter to work properly.

pedrojflores commented 4 years ago

Seeing this as well

[2020-08-13 11:12:46,696] ERROR Got exception in Cruise Control metrics reporter (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.MetricsUtils.yammerMetricScopeToTags(MetricsUtils.java:208)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.MetricsUtils.isInterested(MetricsUtils.java:196)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.YammerMetricProcessor.processGauge(YammerMetricProcessor.java:139)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.YammerMetricProcessor.processGauge(YammerMetricProcessor.java:24)
        at com.yammer.metrics.core.Gauge.processWith(Gauge.java:28)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.reportYammerMetrics(CruiseControlMetricsReporter.java:336)
        at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.run(CruiseControlMetricsReporter.java:268)
        at java.base/java.lang.Thread.run(Thread.java:834)
pedrojflores commented 4 years ago

So to take out mSSL out of the picture as a possible culprit here I went ahead and configured the cruise control metrics reporter to connect to PLAINTEXT ports and updated the acls on the cruise control topics to allow the ANONYMOUS user to access those topics.

cruise.control.metrics.reporter.bootstrap.servers=broker1.ec2.internal:9094,broker2.ec2.internal:9094,broker3.ec2.internal:9094
cruise.control.metrics.reporter.security.protocol=PLAINTEXT

However I'm still seeing the following error in the broker logs:

[2020-09-15 21:55:07,274] ERROR Got exception in Cruise Control metrics reporter (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.lang.ArrayIndexOutOfBoundsException
[2020-09-15 21:56:07,275] ERROR Got exception in Cruise Control metrics reporter (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.lang.ArrayIndexOutOfBoundsException
[2020-09-15 21:57:07,276] ERROR Got exception in Cruise Control metrics reporter (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.lang.ArrayIndexOutOfBoundsException
[2020-09-15 21:58:07,277] ERROR Got exception in Cruise Control metrics reporter (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.lang.ArrayIndexOutOfBoundsException
[2020-09-15 21:59:07,278] ERROR Got exception in Cruise Control metrics reporter (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.lang.ArrayIndexOutOfBoundsException

Any ideas? Anyone?

efeg commented 4 years ago

@pedrojflores According to stack trace shared in https://github.com/linkedin/cruise-control/issues/1296#issuecomment-673570001, this is an issue with failure to parse the scope of a Yammer metric -- i.e. scope is expected have a . in it to separate KV pairs, but in this case it seems to have none.

This might be due to differences between Yammer metrics in Confluent Kafka and Apache Kafka. Created a PR to address this.

efeg commented 4 years ago

@pedrojflores Can you verify if the issue has been resolved in Cruise Control version 2.0.129?