Unhandled concurrency bug when there is an exception in best proposal precomputation.

efeg commented 5 years ago

When the rebalance request does not have explicit specification of goals in parameters, CC analyzer attempts to cache the proposal computation for potential future use (e.g. for dryrun=true and following dryrun=false),
For this caching purpose, it uses a background thread,
Until this background thread computes the proposals to be cached, the main thread waits on a condition variable (CV),
When the ProposalPrecomputingExecutor (i.e. the background thread) is done with the computation, it notifies the CV – but due to a bug, when there are such optimization failures (e.g. insufficient number of racks), this thread fails to execute CV notifyAll(), leaving the main thread in an infinite wait.

The exception in the logs:

ERROR [GoalOptimizer] [ProposalPrecomputingExecutor-1] [kafka-cruise-control] [] Proposal precomputation encountered error
com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: Insufficient number of racks to distribute included replicas.
        at com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal.initGoalState(RackAwareGoal.java:177)
        at com.linkedin.kafka.cruisecontrol.analyzer.goals.AbstractGoal.optimize(AbstractGoal.java:74)
        at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:435)
        at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:381)
        at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer$ProposalCandidateComputer.run(GoalOptimizer.java:689)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

craynic commented 5 years ago

I'm getting error:

ERROR Proposal precomputation encountered error (com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer)
java.lang.IllegalArgumentException: Inconsistent load distribution. Broker utilization for disk is different from the total replica utilization in the broker with id: 1. Sum of the replica utilization: 20869.701074123383, broker utilization: 12601.337890625
at com.linkedin.kafka.cruisecontrol.model.ClusterModel.sanityCheck(ClusterModel.java:905)
at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:453)
at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer.optimizations(GoalOptimizer.java:381)
at com.linkedin.kafka.cruisecontrol.analyzer.GoalOptimizer$ProposalCandidateComputer.run(GoalOptimizer.java:689)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Is it related?

actual disk usage (whole partition is for kafka): /dev/vdb1 515930548 162880772 326818996 34% /data1

efeg commented 5 years ago

@craynic I suspect that this is an unrelated issue.

Is this a transient error or do these errors logged continuously?
Does this issue happen in the background or is it triggered due to a request sent to CC API?
What is the config value of num.proposal.precompute.threads?
Do you run single or multiple brokers per host?

craynic commented 5 years ago

I've created a new issue for this. #466

linkedin / cruise-control

Unhandled concurrency bug when there is an exception in best proposal precomputation. #460