linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Issue with v0.1.4 #325

Closed jmarkan closed 5 years ago

jmarkan commented 6 years ago

Hello, We're on v0.1.4 and just spun up a cruise control machine and see this warning all over the logs:

[2018-09-13 18:31:36,539] WARN Goal violation detector received exception (com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector) com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: Insufficient healthy cluster capacity for resource:disk existing cluster utilization 2215449.25 allowed capacity 720000.0 at com.linkedin.kafka.cruisecontrol.analyzer.goals.CapacityGoal.initGoalState(CapacityGoal.java:173) at com.linkedin.kafka.cruisecontrol.analyzer.goals.AbstractGoal.optimize(AbstractGoal.java:81) at com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector.optimizeForGoal(GoalViolationDetector.java:172) at com.linkedin.kafka.cruisecontrol.detector.GoalViolationDetector.run(GoalViolationDetector.java:125) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

In my capacity.json, the Disk capacity is set as "DISK": "500000", so not sure from where it is getting the value 720000.0.

Could you please suggest @becketqin or @efeg ?

efeg commented 6 years ago

Hi @jmarkan -- Sorry for the late response!

Explanation of the Exception

720000.0 in the exception message corresponds to the [sum of all broker DISK capacities] capacityThreshold for DISK. For example, if (1) the cluster has 5 brokers, (2) one broker (e.g. broker-0) has a disk capacity of 500000.0 and the remaining 4 broker use the default disk capacity of 100000.0 (3) the disk capacity threshold, which is specified by the config disk.capacity.threshold, is 0.8 (the default value specified in config/cruisecontrol.properties) then the cluster disk capacity would be `[(4100000.0)+(1500000.0)]0.8=720000.0`.

In the above exception message, the Goal violation detector, which is 1 out of 3 anomaly detectors of Cruise Control, reports that the total disk space usage over all brokers is 2215449.25, but the allowed disk capacity is 720000.0. Hence, under the given constraints, it cannot satisfy the requirements of Disk Capacity Goal.

Things to Check

  1. Ensure that the capacity.json path is correctly specified in config/cruisecontrol.properties, and you actually modify the capacity.json specified by this config.
  2. If you are providing a default broker capacity (i.e. represented with a broker having -1 as its id), either (1) ensure that you intentionally override the capacity of any broker resources in the cluster, whose capacity is not provided in this file, or (2) explicitly provide the capacity of each broker in the cluster.
  3. Make sure that you bounce your Cruise Control instance after any changes to capacity.json.

Hope it helps!

efeg commented 6 years ago

@jmarkan Has this issue been resolved?

jmarkan commented 6 years ago

Hi @efeg Thanks a lot for a detailed response. Here are my inputs on the things you asked to check: 1) Yes 'capacity.json' is correctly specified in 'config/cruisecontrol.properties' and I put actual broker values in the same. 2) We explicitly provide capacity of each broker in the cluster. 3) Yes, CC is bounced everytime we change any values in capacity.json

We noticed that the values f the disk capacities in 'capacity.json' were incorrect. We're in the process of correcting that values in order to avoid this warning.

Once we do that, I'll confirm here if the issue is fixed or not.

jmarkan commented 6 years ago

Hi @efeg Looks like the original issue got fixed after I applied your suggestions. Thanks a lot for those. However, we're seeing these in the CC logs:

[2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6012 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6008 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6004 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6013 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6007 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6014 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6002 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6001 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6015 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6011 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6009 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6005 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6003 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6006 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] WARN Skip generating broker metric sample for broker 6010 because the following metrics are missing [] (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,430] INFO Generated 4471 partition metric samples and 0(15 skipped) broker metric samples for timestamp 1538060251819 (com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsProcessor) [2018-09-27 14:57:40,446] INFO Collected 4471 partition metric samples for 4471 partitions. Total partition assigned: 4471. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher) [2018-09-27 14:57:40,446] INFO Collected 0 broker metric samples for 0 brokers. (com.linkedin.kafka.cruisecontrol.monitor.sampling.SamplingFetcher)

I took a look on broker's logs, and didn't see any reference of it not being able to send metrics to the CC topics. I also validated that the CC jar file is at the proper location on the brokers.

Is it something you could perhaps take a look and suggest?

jmarkan commented 6 years ago

@efeg following up if you have any suggestions for the WARN logs I posted above?

efeg commented 5 years ago

@jmarkan this issue has recently been fixed in https://github.com/linkedin/cruise-control/pull/443 -- thanks for reporting, sorry for the delayed fix.

qz-fordham commented 3 years ago

@efeg Thanks for the discussion up there. There is one thing I don't understand. This maybe a dummy question. For the original error message

disk existing cluster utilization 2215449.25 allowed capacity 720000.0

If 80% of all disk space is 720,000MB then 100% of all disk space will be 900,000 MB, which is still less than 2,215,449.25 MB. My question is if all disk space sum is only 900,000MB, then how and where those extra data (2215449-900000=1315449MB) being stored?

efeg commented 3 years ago

@qz-fordham Unless users implement their own pluggable capacity resolver, the default capacity resolver retrieves the broker capacity information from a file. This file is expected to be populated by users, and it should reflect the real capacity of brokers. If users forget populating this file or use incorrect capacity information while doing so, CC would get unrealistic capacity information. Whereas bytes stored on Kafka logs comes from Kafka metrics; hence, they represent the actual / current data.

In this example, looks like the user indicated the following:

We noticed that the values f the disk capacities in 'capacity.json' were incorrect. We're in the process of correcting that values in order to avoid this warning.

Hope it clarifies the root cause.

qz-fordham commented 3 years ago

@efeg Thank you so much! That definitely cleared things out.

A small suggestion then. Since capacity file need to be specified by the user (otherwise not accurate), maybe it's worth to mention in quick-start tutorial. Thanks again.

efeg commented 3 years ago

@qz-fordham That is a good suggestion -- would you be interested in creating a PR to update the relevant README.md?

qz-fordham commented 3 years ago

@efeg Yes. In fact, here I created a PR already. Let me know if there is any instruction I need to be aware of for making PRs for this repo.