Closed Aaron-ML closed 1 year ago
The kafka broker does not seem to be remediated, however if I recreate the broker it comes back online without issue. However cruise control will refuse to do any more remediation because of logs that have isFuture: true
on the broker.
I'm not sure this sounds like a Strimzi bug. Seems more like a bug in Cruise Control / Kafka.
Triaged on the community call on 24th August: Looks like a Cruise control bug, but we should wait for the next call to see what @ppatierno and @kyguy think before closing this.
Taking a look at the Kafka client code, the currentReplicaLogDir
can be actually null ...
static public class ReplicaLogDirInfo {
// The current log directory of the replica of this partition on the given broker.
// Null if no replica is not found for this partition on the given broker.
private final String currentReplicaLogDir;
It sounds to me it's Cruise Control not handling correctly this situation. Due to the NPE it fails the operation so the error is returned to the operator.
Triaged on 7.9.2023:
Let's keep it open until the next call or until Kyle confirms the analysis described above.
Triaged on community call on 5.10.2023: Does not seem to be like a Strimzi issue and should be closed. If some new reasons show up to investigate it in Strimzi, it can be reopened.
Bug Description
When creating a rebalanceDisk kafkarebalance request to CruiseControl and approving it. Cruise control will attempt to execute and error out at random intervals. This will also knock a kafka broker log dir offline somehow with Strimzi not remediating.
Steps to reproduce
rebalanceDisks: true
Expected behavior
Rebalance should complete and not kill brokers, also Strimzi should be able to detect the issue and remediate after.
Strimzi version
0.35.1
Kubernetes version
Kubernetes 1.26.6
Installation method
Helm chart
Infrastructure
Azure AKS
Configuration files and logs
Cruise Control log:
Kafka broker log:
kafka crd:
kafkaRebalance.yaml
Additional context
This doesn't always seem to happen, I've had it happen 2 out of 5 times during tests. Initially I thought I was tapping out Azure disks so I added a replication.throttle but it does not seem the change the error behavior.
I've exec'd into the brokers and confirmed the disks do not lose connectivity in the pods so I'm unsure why the kafka brokers think they are offline, or why Cruise Control executor stops processing.
Seems related to the behavior seen here: https://cloud-native.slack.com/archives/CMH3Q3SNP/p1683333344439769