linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.76k stars 592 forks source link

Misleading error message when broker data can't be fetched from Zookeeper #155

Closed chowell5 closed 6 years ago

chowell5 commented 6 years ago

Hi,

I'm trying to get Cruise Control up and running, first of all I got

Exception in thread "main" java.lang.IllegalStateException: Cruise Control cannot find sampling topic matches __CruiseControlMetrics in the target cluster.

Well that's totally expected, I didn't create the topic. So I created it.

/usr/local/kafka/bin/kafka-topics.sh --zookeeper 36zk1.in.xxx:2181/kafka/abc --create --partitions 10 --topic __CruiseControlMetrics --replication-factor 3 --config cleanup.policy=delete

Now I receive

Exception in thread "main" org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor must be larger than 0

This is the only message I receive, no backtrace.

Config looks fine to me, definitely not 0.

chris@xxxx:~$ /usr/local/kafka/bin/kafka-topics.sh --zookeeper 36zk1.in.xxx:2181/kafka/abc --describe --topic  __CruiseControlMetrics
Topic:__CruiseControlMetrics    PartitionCount:10   ReplicationFactor:3 Configs:cleanup.policy=delete
    Topic: __CruiseControlMetrics   Partition: 0    Leader: 10130   Replicas: 10130,10227,10113 Isr: 10130,10227,10113
    Topic: __CruiseControlMetrics   Partition: 1    Leader: 10227   Replicas: 10227,10113,10128 Isr: 10227,10113,10128
    Topic: __CruiseControlMetrics   Partition: 2    Leader: 10113   Replicas: 10113,10128,10100 Isr: 10113,10128,10100
    Topic: __CruiseControlMetrics   Partition: 3    Leader: 10128   Replicas: 10128,10100,10114 Isr: 10128,10100,10114
    Topic: __CruiseControlMetrics   Partition: 4    Leader: 10100   Replicas: 10100,10114,10129 Isr: 10100,10114,10129
    Topic: __CruiseControlMetrics   Partition: 5    Leader: 10114   Replicas: 10114,10129,10102 Isr: 10114,10129,10102
    Topic: __CruiseControlMetrics   Partition: 6    Leader: 10129   Replicas: 10129,10102,10115 Isr: 10129,10102,10115
    Topic: __CruiseControlMetrics   Partition: 7    Leader: 10102   Replicas: 10102,10115,10130 Isr: 10102,10115,10130
    Topic: __CruiseControlMetrics   Partition: 8    Leader: 10115   Replicas: 10115,10130,10227 Isr: 10115,10130,10227
    Topic: __CruiseControlMetrics   Partition: 9    Leader: 10130   Replicas: 10130,10114,10102 Isr: 10130,10114,10102
efeg commented 6 years ago

I assume you have auto.create.topics.enable set to false? In this case, it is the right thing to create the topic manually. The command also looks correct.

However, the exception seems to be thrown during the topic creation from Kafka (the exact line is here).

Do Kafka logs show Exception in thread "main" java.lang.IllegalStateException: Cruise Control cannot find sampling topic matches __CruiseControlMetrics in the target cluster.? Is this a transient error that happened just once?

chowell5 commented 6 years ago

Hi,

Thanks for your reply, we do indeed set auto.create.topics.enable=false in our broker config.

I cannot see anything in our broker logs related to this. The error Exception in thread "main" org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor must be larger than 0 was generated by Cruise Control on startup on stderr.

It's a persistent error and happens every time I try to start Cruise Control, on all of our clusters. I repeated the steps several times; I tried deleting the topic and the error went back to Exception in thread "main" java.lang.IllegalStateException: Cruise Control cannot find sampling topic matches __CruiseControlMetrics in the target cluster.

After I re-created it, it went back to Exception in thread "main" org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor must be larger than 0

I do agree from your link that it looks like it it should only happen during topic creation, but as you can see from the output of kafka-topics.sh, Kafka had already created the topic by this point. Complete set of logs:

chris@xxx:~$ sudo docker logs cf71348ee9c4 2>&1 | tail -30
[2018-03-08 13:28:19,828] WARN The configuration 'num.metric.fetchers' was supplied but isn't a known config. (org.apache.kafka.clients.producer.ProducerConfig)
[2018-03-08 13:28:19,828] WARN The configuration 'sample.store.class' was supplied but isn't a known config. (org.apache.kafka.clients.producer.ProducerConfig)
[2018-03-08 13:28:19,828] WARN The configuration 'default.goals' was supplied but isn't a known config. (org.apache.kafka.clients.producer.ProducerConfig)
[2018-03-08 13:28:19,828] WARN The configuration 'num.load.snapshots' was supplied but isn't a known config. (org.apache.kafka.clients.producer.ProducerConfig)
[2018-03-08 13:28:19,828] WARN The configuration 'min.samples.per.load.snapshot' was supplied but isn't a known config. (org.apache.kafka.clients.producer.ProducerConfig)
[2018-03-08 13:28:19,828] INFO Kafka version : 0.10.2.1 (org.apache.kafka.common.utils.AppInfoParser)
[2018-03-08 13:28:19,828] INFO Kafka commitId : e89bffd6b2eff799 (org.apache.kafka.common.utils.AppInfoParser)
[2018-03-08 13:28:20,043] INFO Starting ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[2018-03-08 13:28:20,049] INFO Client environment:zookeeper.version=3.4.9-1757313, built on 08/23/2016 06:50 GMT (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,049] INFO Client environment:host.name=xxx (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,049] INFO Client environment:java.version=1.8.0_151 (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:java.vendor=Oracle Corporation (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:java.home=/usr/lib/jvm/oracle-java8-jdk-amd64/jre (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:java.class.path=/cruise-control/cruise-control/build/dependant-libs/commons-math3-3.6.1.jar:/cruise-control/cruise-control/build/dependant-libs/cruise-control-core.jar:/cruise-control/cruise-control/build/dependant-libs/cruise-control-metrics-reporter.jar:/cruise-control/cruise-control/build/dependant-libs/gson-2.7.jar:/cruise-control/cruise-control/build/dependant-libs/hamcrest-core-1.3.jar:/cruise-control/cruise-control/build/dependant-libs/javax.servlet-api-3.1.0.jar:/cruise-control/cruise-control/build/dependant-libs/jetty-http-9.4.6.v20170531.jar:/cruise-control/cruise-control/build/dependant-libs/jetty-io-9.4.6.v20170531.jar:/cruise-control/cruise-control/build/dependant-libs/jetty-security-9.4.6.v20170531.jar:/cruise-control/cruise-control/build/dependant-libs/jetty-server-9.4.6.v20170531.jar:/cruise-control/cruise-control/build/dependant-libs/jetty-servlet-9.4.6.v20170531.jar:/cruise-control/cruise-control/build/dependant-libs/jetty-util-9.4.6.v20170531.jar:/cruise-control/cruise-control/build/dependant-libs/jline-0.9.94.jar:/cruise-control/cruise-control/build/dependant-libs/jopt-simple-5.0.3.jar:/cruise-control/cruise-control/build/dependant-libs/junit-4.12.jar:/cruise-control/cruise-control/build/dependant-libs/kafka-clients-0.10.2.1.jar:/cruise-control/cruise-control/build/dependant-libs/kafka_2.11-0.10.2.1.jar:/cruise-control/cruise-control/build/dependant-libs/log4j-1.2.17.jar:/cruise-control/cruise-control/build/dependant-libs/lz4-1.3.0.jar:/cruise-control/cruise-control/build/dependant-libs/metrics-core-2.2.0.jar:/cruise-control/cruise-control/build/dependant-libs/metrics-core-3.2.2.jar:/cruise-control/cruise-control/build/dependant-libs/netty-3.10.5.Final.jar:/cruise-control/cruise-control/build/dependant-libs/scala-library-2.11.11.jar:/cruise-control/cruise-control/build/dependant-libs/scala-parser-combinators_2.11-1.0.4.jar:/cruise-control/cruise-control/build/dependant-libs/slf4j-api-1.7.25.jar:/cruise-control/cruise-control/build/dependant-libs/slf4j-log4j12-1.7.21.jar:/cruise-control/cruise-control/build/dependant-libs/snappy-java-1.1.2.6.jar:/cruise-control/cruise-control/build/dependant-libs/zkclient-0.10.jar:/cruise-control/cruise-control/build/dependant-libs/zookeeper-3.4.9.jar:/cruise-control/cruise-control/build/libs/cruise-control.jar:/cruise-control/cruise-control-metrics-reporter/build/libs/cruise-control-metrics-reporter.jar (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:java.compiler=<NA> (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:os.version=4.9.xxx (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:user.name=nobody (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:user.home=/nonexistent (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,050] INFO Client environment:user.dir=/cruise-control (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,051] INFO Initiating client connection, connectString=<xxx> sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@609db43b (org.apache.zookeeper.ZooKeeper)
[2018-03-08 13:28:20,070] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2018-03-08 13:28:20,084] INFO Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2018-03-08 13:28:20,085] INFO Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2018-03-08 13:28:20,093] INFO Session establishment complete on server x.x.x.x/x.x.x.x:2181, sessionid = 0x46201fe7b840f3e, negotiated timeout = 30000 (org.apache.zookeeper.ClientCnxn)
[2018-03-08 13:28:20,095] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
Exception in thread "main" org.apache.kafka.common.errors.InvalidReplicationFactorException: replication factor must be larger than 0

Cheers

efeg commented 6 years ago

Thanks for the clarification! -- so this error is generated by Cruise Control (CC) on startup, after the manual creation of __CruiseControlMetrics topic.

This indicates that the error indeed happens when CC attempts to create Sample Store topics (default: __KafkaCruiseControlPartitionMetricSamples and __KafkaCruiseControlModelTrainingSamples defined in cruisecontrol.properties). Note that in this case having auto.create.topics.enable=false does not really have any adverse effects on explicit topic creation.

The issue here seems to be the non-positive replication factor interpretation by CC (the relevant line) due to getting a potentially empty sequence of brokers from the Kafka cluster (the relevant line).

chowell5 commented 6 years ago

Hi, thanks for your reply - that's correct. Thanks for the links to the code, that'll help me, I was struggling to figure out where the error was coming from.

WATCHER::

WatchedEvent state:SyncConnected type:None path:null [10129, 10227, 10128, 10130, 10100, 10114, 10113, 10102, 10115]


Thank you
chowell5 commented 6 years ago

After speaking to my colleagues (thanks!) we traced this down to a misconfiguration; we're using the feature of kafka that allows to specify a "choot" in the ZK tree: https://kafka.apache.org/08/documentation.html

- zookeeper.connect:
Zookeeper also allows you to add a "chroot" path which will make all
kafka data for this cluster appear under a particular path. This is a
way to setup multiple Kafka clusters or other applications on the same
zookeeper cluster. To do this give a connection string in the form
hostname1:port1,hostname2:port2,hostname3:port3/chroot/path which
would put all this cluster's data under the path /chroot/path. Note
that you must create this path yourself prior to starting the broker
and consumers must use the same connection string.

After appending /kafka/logs to the zookeeper.connect string CC correctly starts up. I believe this should remain a bug though as the error message is very misleading, it should report that the broker data couldn't be fetched from Zookeepr, rather than something about replication factors.

Thanks for your help in debugging this.

efeg commented 6 years ago

The error message is more clear after PR https://github.com/linkedin/cruise-control/pull/156 and https://github.com/linkedin/cruise-control/pull/169.