Closed CCisGG closed 1 year ago
👋 We've also experienced this behaviour, where Cruise Control doesn't accurately detect and display offline partitions.
Before creating offline partitions:
After creating offline partitions, when describing our topic's partitions with kcat:
topic "test-ayana-3" with 1 partitions:
partition 0, leader -1, replicas: 1, isrs: 1, Broker: Leader not available
However, Cruise Control still displays "No offline partitions":
Proposed solution: like @CCisGG mentioned, we can check if the leader is alive. Currently, isOffline
is set as true when inSyncReplicas
is an empty list (src). Instead, we could use the leader
value on the PartitionInfo
object (we’d need to check the possible values for this, but I think it’d just be a check for if it’s empty, null, or -1)
@CCisGG Would you know if this issue is being tackled, or is this something I can pick up?
@ayana-s as far as I know there is no ongoing work for this one. It would be great if you can help with it! Thank you!
Today the way Cruise Control determine whether a partition is offline, is to check whether ISR set is empty:
https://github.com/linkedin/cruise-control/blob/a8a190f7c4662e1fc742994792c75b108fd9064d/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/servlet/response/ClusterPartitionState.java#L105
While this works correctly for most of the time, it's actually not consistent with definition inside Kafka. In Kafka, an offline partition is a partition whose leader is not alive.
Recently we've have detect some issues that for a partition that is in a bad state:
{"controller_epoch":10000,"leader":-1,"version":1,"leader_epoch":50,"isr":[12345]}
And Cruise Control is not able to detect it as a offline partition while the kafka metric does show it as offline partition. Cruise Control only treat is as an URP.
I think cruise control should use the same criteria "whether leader is alive" to determine whether the partition is offline.