linkedin / cruise-control

Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a Kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
https://github.com/linkedin/cruise-control/tags
BSD 2-Clause "Simplified" License
2.74k stars 587 forks source link

Cruise Control sometimes is not able to detect offline partitions #1828

Closed CCisGG closed 1 year ago

CCisGG commented 2 years ago

Today the way Cruise Control determine whether a partition is offline, is to check whether ISR set is empty:

https://github.com/linkedin/cruise-control/blob/a8a190f7c4662e1fc742994792c75b108fd9064d/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/servlet/response/ClusterPartitionState.java#L105

While this works correctly for most of the time, it's actually not consistent with definition inside Kafka. In Kafka, an offline partition is a partition whose leader is not alive.

Recently we've have detect some issues that for a partition that is in a bad state:

{"controller_epoch":10000,"leader":-1,"version":1,"leader_epoch":50,"isr":[12345]}

And Cruise Control is not able to detect it as a offline partition while the kafka metric does show it as offline partition. Cruise Control only treat is as an URP.

I think cruise control should use the same criteria "whether leader is alive" to determine whether the partition is offline.

ayana-s commented 1 year ago

👋 We've also experienced this behaviour, where Cruise Control doesn't accurately detect and display offline partitions.

Before creating offline partitions:

Screenshot 2023-03-22 at 1 20 01 PM

After creating offline partitions, when describing our topic's partitions with kcat:

topic "test-ayana-3" with 1 partitions:
   partition 0, leader -1, replicas: 1, isrs: 1, Broker: Leader not available

However, Cruise Control still displays "No offline partitions":

Screenshot 2023-03-22 at 1 12 37 PM

Proposed solution: like @CCisGG mentioned, we can check if the leader is alive. Currently, isOffline is set as true when inSyncReplicas is an empty list (src). Instead, we could use the leader value on the PartitionInfo object (we’d need to check the possible values for this, but I think it’d just be a check for if it’s empty, null, or -1)

@CCisGG Would you know if this issue is being tackled, or is this something I can pick up?

CCisGG commented 1 year ago

@ayana-s as far as I know there is no ongoing work for this one. It would be great if you can help with it! Thank you!