linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

OptimizationFailureException for ReplicaCapacityGoal on disk failure #1822

Closed RagingPuppies closed 2 years ago

RagingPuppies commented 2 years ago

Hi, first of all, Thanks! I'm really enjoying CC! I've being POCing the CC in my organization, We are working with kafka 2.5.0 and ofcourse CC 2.5 as well (ubuntu18). The poc is being conducted on 3 brokers, each has 3 disks. I'm now into the self-healing part and im deleting the data directory of one of the disks in kafka-poc-40003 machine, CC analyzes and shows that i have a bad disk. my cruisecontrol.properties has this property: max.replicas.per.broker=10000 and the current partition count is about 1500 per broker, 500 per disk. i get this exception when looking for proposals:

ERROR: Error processing GET request '/proposals' due to: 'com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: [ReplicaCapacityGoal] Failed to move offline replica Replica[isLeader=false,rack=kafka-poc-40003,broker=40003,TopicPartition=__consumer_offsets-2,origBroker=40003,isOriginalOffline=true,isCurrentOffline=true] of partition <Partition> <Leader>Replica[isLeader=true,rack=kafka-poc-40002,broker=40002,TopicPartition=__consumer_offsets-2,origBroker=40002,isOriginalOffline=false,isCurrentOffline=false]</Leader> <Follower>Replica[isLeader=false,rack=kafka-poc-40003,broker=40003,TopicPartition=__consumer_offsets-2,origBroker=40003,isOriginalOffline=true,isCurrentOffline=true]</Follower> <Follower>Replica[isLeader=false,rack=kafka-poc-40001,broker=40001,TopicPartition=__consumer_offsets-2,origBroker=40001,isOriginalOffline=false,isCurrentOffline=false]</Follower> </Partition>%n to a broker in [Broker[id=40002,rack=kafka-poc-40002,state=ALIVE,replicaCount=1720,logdirs=[]], Broker[id=40001,rack=kafka-poc-40001,state=ALIVE,replicaCount=1744,logdirs=[]]]. Per broker limit: 10000 for brokers: [Broker[id=40001,rack=kafka-poc-40001,state=ALIVE,replicaCount=1744,logdirs=[]], Broker[id=40002,rack=kafka-poc-40002,state=ALIVE,replicaCount=1720,logdirs=[]], Broker[id=40003,rack=kafka-poc-40003,state=BAD_DISKS,replicaCount=1628,logdirs=[]]] Add at least 1 broker. Add at least 1 broker.'.
ERROR: Error processing GET request '/proposals' due to: 'com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: [DiskCapacityGoal] Cannot remove offline replicas from broker 40003. Add at least 1 broker for disk. Add at least 1 broker for disk.'.

but as far as i understand ReplicaCapacityGoal should make sure i don't pass the 10000 partitions threshold, This actualy happens with many Goals, such as DiskCapacityGoal and RackawareGoal. so, any suggestions?

Thanks

RagingPuppies commented 2 years ago

Seems like Cruise Control won't pass data to the bad broker (even tho it has more disks), adding another broker solved that, i wonder if it's worth considering this broker healthy, but with bad disk.