linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Unhandled concurrency bug when there is an exception in best proposal precomputation. #460

Closed efeg closed 5 years ago

efeg commented 5 years ago
  1. When the rebalance request does not have explicit specification of goals in parameters, CC analyzer attempts to cache the proposal computation for potential future use (e.g. for dryrun=true and following dryrun=false),
  2. For this caching purpose, it uses a background thread,
  3. Until this background thread computes the proposals to be cached, the main thread waits on a condition variable (CV),
  4. When the ProposalPrecomputingExecutor (i.e. the background thread) is done with the computation, it notifies the CV – but due to a bug, when there are such optimization failures (e.g. insufficient number of racks), this thread fails to execute CV notifyAll(), leaving the main thread in an infinite wait.

The exception in the logs:

ERROR [GoalOptimizer] [ProposalPrecomputingExecutor-1] [kafka-cruise-control] [] Proposal precomputation encountered error
com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: Insufficient number of racks to distribute included replicas.
        at com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal.initGoalState(RackAwareGoal.java:177)
        at com.linkedin.kafka.cruisecontrol.analyzer.goals.AbstractGoal.optimize(AbstractGoal.java:74)
        at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:435)
        at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:381)
        at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer$ProposalCandidateComputer.run(GoalOptimizer.java:689)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
craynic commented 5 years ago

I'm getting error:

ERROR Proposal precomputation encountered error (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
java.lang.IllegalArgumentException: Inconsistent load distribution. Broker utilization for disk is different from the total replica utilization in the broker with id: 1. Sum of the replica utilization: 20869.701074123383, broker utilization: 12601.337890625
at com.linkedin.kafka.cruisecontrol.model.ClusterModel.sanityCheck(ClusterModel.java:905)
at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:453)
at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:381)
at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer$ProposalCandidateComputer.run(GoalOptimizer.java:689)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Is it related?

actual disk usage (whole partition is for kafka): /dev/vdb1 515930548 162880772 326818996 34% /data1

efeg commented 5 years ago

@craynic I suspect that this is an unrelated issue.

  1. Is this a transient error or do these errors logged continuously?
  2. Does this issue happen in the background or is it triggered due to a request sent to CC API?
  3. What is the config value of num.proposal.precompute.threads?
  4. Do you run single or multiple brokers per host?
craynic commented 5 years ago

I've created a new issue for this. #466