linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Getting "java.lang.IllegalArgumentException: The partition bytes out rate is greater than the broker bytes out rate" exception #147

Closed jmarkan closed 5 years ago

jmarkan commented 6 years ago

Hello, I started CruiseControl for the very 1st time about 9 hours ago and when I checked GET /kafkacruisecontrol/state, I see that "trainingPct" was stuck at 20%. Upon further investigating the /logs/kafkacruisecontrol.log, I saw the following exception repeating:

[2018-02-28 12:08:15,614] ERROR Error building partition metric sample for __CruiseControlMetrics-5 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) java.lang.IllegalArgumentException: The partition bytes out rate 103.663761 is greater than the broker bytes out rate 95.375264 at com.linkedin.kafka.cruisecontrol.model.ModelUtils.estimateLeaderCpuUtil(ModelUtils.java:81) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.buildPartitionMetricSample(CruiseControlMetricsProcessor.java:224) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.process(CruiseControlMetricsProcessor.java:73) at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.getSamples(CruiseControlMetricsReporterSampler.java:118) at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchSamples(SamplingFetcher.java:89) at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchMetricsForAssignedPartitions(SamplingFetcher.java:72) at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:24) at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:16) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Could someone please suggest if a config needs to be tuned to get over this error? Its stuck at 20% for many hours now.

Any help on this will be much appreciated.

fmunteanu commented 6 years ago

@becketqin @efeg, I'm experiencing the same issue, on a AWS CentOS7 x64 instance:

# uname -a
Linux ip-10-38-55-39.aws.internal 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

[2018-03-05 20:04:22,862] ERROR Error building partition metric sample for __CruiseControlMetrics-7 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
java.lang.IllegalArgumentException: The partition bytes out rate 49.992053 is greater than the broker bytes out rate 45.994949
    at com.linkedin.kafka.cruisecontrol.model.ModelUtils.estimateLeaderCpuUtil(ModelUtils.java:81)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.buildPartitionMetricSample(CruiseControlMetricsProcessor.java:224)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.process(CruiseControlMetricsProcessor.java:73)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.getSamples(CruiseControlMetricsReporterSampler.java:118)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchSamples(SamplingFetcher.java:89)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchMetricsForAssignedPartitions(SamplingFetcher.java:72)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:24)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:16)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
[2018-03-05 20:04:22,862] ERROR Error building partition metric sample for __CruiseControlMetrics-5 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor)
java.lang.IllegalArgumentException: The partition bytes out rate 23.872961 is greater than the broker bytes out rate 21.964185
    at com.linkedin.kafka.cruisecontrol.model.ModelUtils.estimateLeaderCpuUtil(ModelUtils.java:81)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.buildPartitionMetricSample(CruiseControlMetricsProcessor.java:224)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor.process(CruiseControlMetricsProcessor.java:73)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.getSamples(CruiseControlMetricsReporterSampler.java:118)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchSamples(SamplingFetcher.java:89)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher.fetchMetricsForAssignedPartitions(SamplingFetcher.java:72)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:24)
    at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcher.call(MetricFetcher.java:16)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
becketqin commented 6 years ago

@jmarkan @fmunteanu Sorry for the late response. Cruise Control tries to do the sanity check on the metrics reported by the broker. Basically the metrics reported at the partition level should match the metrics reported at the broker level. Sometimes this check may fail because the throughput was too low and the yammer metrics used by the broker may produce inconsistent result.

We are aware of this issue and trying to explore whether using a five-min-rate rather than a one-minute-rate would help solve the problem.

fmunteanu commented 6 years ago

@becketqin could this affect cruise-control being stuck at a percentage in /state? I had a cluster spawn with cruise control at 100%, killed one broker and expected to have the data recovered. Cruise control was stuck at 4% progress in /state. I don't have any real numbers, as I'm home, but I can post details tomorrow morning. I simply want to make sure this not a related issue.

becketqin commented 6 years ago

@fmunteanu by stuck at 4%, which exact percentage are you referring to? Can you paste the /state output?

jmarkan commented 6 years ago

@becketqin here is the screenshot of the /state when the monitored window % got down to 4% as soon as a broker was killed.

image

becketqin commented 6 years ago

@jmarkan What was your Kafka version? Did you see anything in the Cruise Control log?

jmarkan commented 6 years ago

@becketqin The kafka version that we have is 0.11. As for the cruise control log, here is what we saw:

image

To me it looks like when we killed a broker, it broke the cruise control's topics as well. The RF for all 3 CC topics was 3, and looking at the CC logs it looked as if it was stuck in fixing its own topics.

Based on this, we thought to re-engineer the solution by having a separate kafka cluster for cruise control, which would contain its topics only. We'll have another kafka cluster hosting the actual topics. In that cluster, we pointed the config to send the metrics to the CC kafka cluster. However, this didn't work as aparantly cruise control expects to see those topics in the cluster it is configured to monitor. So seeing this, we wrecked the kafka cluster we created for cruise control, and created its topics in the target cluster. We also increased the RF for CC topics from 3 to 8 (equaling the # of kafka brokers we have).

I'll now kill 1 broker and see how cruise control reacts and gather logs.

efeg commented 6 years ago

@jmarkan in the cluster for which you sent the screenshot, can you check whether the default.replication.factor in your broker configs equals to 4 (i.e. the number of brokers in your cluster). It looks like the RF of __CruiseControlMetrics is 4. Hence, when a broker is dead, e.g. broker-6030 in your case, the replica on dead broker cannot be moved to another broker due to having another replica from the same partition at all other brokers.

In general, if the RF is set to the number of brokers in the cluster, Cruise Control would be unable move replicas between brokers. Note that there is a requirement that no two replicas from the same partition can reside in the same broker.

jmarkan commented 6 years ago

@efeg Unfortunately the cluster that I tried to get this to monitor doesnt exist anymore. However, I didnt change the default RF setting that we enforce, which is 3. So RF was 3 in that cluster and in the current cluster RF is still 3 for all topics except the CC topics which is 8. Per your suggestion above, I'm ready to set the RF for the CC topics down from 8, but I need you to suggest if I should set it to 3 like other topics or any other optimum value?

efeg commented 6 years ago

@jmarkan sure please set it to 3. No topic should have a replication factor >= number of brokers for cruise control to work as expected.

jmarkan commented 6 years ago

@efeg ok, I just set it back to 3 for all 3 CC topics. Currently, no topic is having > RF=3. CC looks good so far with all goals ready:

image

I'll now stop 1 broker in a few mins to see if CC works as expected or not. I'll post logs here after this test.

jmarkan commented 6 years ago

@efeg Before I went in to stop 1 broker, I checked CC logs and I'm seeing these lines constantly. Although when I check /state, I dont see anything abnormal:

`[2018-03-16 19:08:38,891] ERROR Proposal precomputation encountered error (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) java.lang.NullPointerException [2018-03-16 19:08:38,895] ERROR Proposal precomputation encountered error (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) java.lang.NullPointerException [2018-03-16 19:08:38,895] INFO Finished precomputation 10 proposal candidates in 42 ms (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer) [2018-03-16 19:08:41,621] ERROR Unexpected exception (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) java.lang.NullPointerException [2018-03-16 19:08:41,621] ERROR Unexpected exception (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) java.lang.NullPointerException

Should I be worried?

Here is the screenshot of /state?verbose=true image

efeg commented 5 years ago

The exception reported at the title of this issue has been fixed (https://github.com/linkedin/cruise-control/commit/0f773ca5ea403644b869f437f8188d685683b1f9) a while ago, the NPE issue should have also been resolved in the current version. Closing the issue.