[Bug]: KafkaRebalance diskbalance request fails and bricks broker

Aaron-ML commented 1 year ago

Bug Description

When creating a rebalanceDisk kafkarebalance request to CruiseControl and approving it. Cruise control will attempt to execute and error out at random intervals. This will also knock a kafka broker log dir offline somehow with Strimzi not remediating.

Steps to reproduce

Create kafkaRebalance request with rebalanceDisks: true
Watch Brokers start to move data between JBOD disks.
Cruise Control will error out (log below) and subsequently a broker disk will disappear from the broker service. Even though the log dir is online in the pod without any issues.

Expected behavior

Rebalance should complete and not kill brokers, also Strimzi should be able to detect the issue and remediate after.

Strimzi version

0.35.1

Kubernetes version

Kubernetes 1.26.6

Installation method

Helm chart

Infrastructure

Azure AKS

Configuration files and logs

Cruise Control log:

2023-08-22 00:37:09 ERROR Executor:1224 - Executor got exception during execution
java.lang.NullPointerException: Cannot invoke "String.equals(Object)" because the return value of "org.apache.kafka.clients.admin.DescribeReplicaLogDirsResult$ReplicaLogDirInfo.getCurrentReplicaLogDir()" is null
    at com.linkedin.kafka.cruisecontrol.executor.ExecutionUtils.isIntraBrokerReplicaActionDone(ExecutionUtils.java:696) ~[cruise-control-2.5.112.jar:?]
    at com.linkedin.kafka.cruisecontrol.executor.Executor$ProposalExecutionRunnable.waitForIntraBrokerReplicaTasksToFinish(Executor.java:1693) ~[cruise-control-2.5.112.jar:?]
    at com.linkedin.kafka.cruisecontrol.executor.Executor$ProposalExecutionRunnable.intraBrokerMoveReplicas(Executor.java:1419) ~[cruise-control-2.5.112.jar:?]
    at com.linkedin.kafka.cruisecontrol.executor.Executor$ProposalExecutionRunnable.execute(Executor.java:1203) ~[cruise-control-2.5.112.jar:?]
    at com.linkedin.kafka.cruisecontrol.executor.Executor$ProposalExecutionRunnable.run(Executor.java:1112) ~[cruise-control-2.5.112.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]
2023-08-22 00:37:09 WARN  operationLogger:1259 - Task [0cb25ee8-cc22-43e8-97a8-5ad2d8e94079] userPOST /kafkacruisecontrol/rebalance execution is interrupted with exception Cannot invoke "String.equals(Object)" because the return value of "org.apache.kafka.clients.admin.DescribeReplicaLogDirsResult$ReplicaLogDirInfo.getCurrentReplicaLogDir()" is null.

Kafka broker log:

2023-08-22 00:36:59,909 ERROR Error while flushing log for connect-cluster-status-1 in dir /var/lib/kafka/data-0/kafka-log1 with offset 39 (exclusive) and recovery point 39 (kafka.server.LogDirFailureChannel) [kafka-scheduler-9]
java.nio.file.NoSuchFileException: /var/lib/kafka/data-0/kafka-log1/connect-cluster-status-1.48edde1460ca4c3aae142d4673f25c62-future
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
    at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
    at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:972)
    at kafka.log.LocalLog.flush(LocalLog.scala:177)
    at kafka.log.UnifiedLog.$anonfun$flush$2(UnifiedLog.scala:1585)
    at kafka.log.UnifiedLog.flush(UnifiedLog.scala:1772)
    at kafka.log.UnifiedLog.flushUptoOffsetExclusive(UnifiedLog.scala:1566)
    at kafka.log.UnifiedLog.$anonfun$roll$1(UnifiedLog.scala:1547)
    at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
2023-08-22 00:36:59,909 ERROR Uncaught exception in scheduled task 'flush-log' (kafka.utils.KafkaScheduler) [kafka-scheduler-9]
org.apache.kafka.common.errors.KafkaStorageException: Error while flushing log for connect-cluster-status-1 in dir /var/lib/kafka/data-0/kafka-log1 with offset 39 (exclusive) and recovery point 39
Caused by: java.nio.file.NoSuchFileException: /var/lib/kafka/data-0/kafka-log1/connect-cluster-status-1.48edde1460ca4c3aae142d4673f25c62-future
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
    at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
    at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
    at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:972)
    at kafka.log.LocalLog.flush(LocalLog.scala:177)
    at kafka.log.UnifiedLog.$anonfun$flush$2(UnifiedLog.scala:1585)
    at kafka.log.UnifiedLog.flush(UnifiedLog.scala:1772)
    at kafka.log.UnifiedLog.flushUptoOffsetExclusive(UnifiedLog.scala:1566)
    at kafka.log.UnifiedLog.$anonfun$roll$1(UnifiedLog.scala:1547)
    at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
2023-08-22 00:36:59,951 WARN Stopping serving logs in dir /var/lib/kafka/data-0/kafka-log1 (kafka.log.LogManager) [LogDirFailureHandler]
2023-08-22 00:36:59,951 WARN Stopping cleaning logs in dir /var/lib/kafka/data-0/kafka-log1 (kafka.log.LogCleaner) [LogDirFailureHandler]

2023-08-22 00:54:28,678 WARN [ReplicaManager broker=1] Unable to describe replica dirs for /var/lib/kafka/data-0/kafka-log1 (kafka.server.ReplicaManager) [data-plane-kafka-request-handler-0]
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/lib/kafka/data-0/kafka-log1 is offline
2023-08-22 00:54:42,270 WARN [ReplicaManager broker=1] Unable to describe replica dirs for /var/lib/kafka/data-0/kafka-log1 (kafka.server.ReplicaManager) [data-plane-kafka-request-handler-1]
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/lib/kafka/data-0/kafka-log1 is offline
2023-08-22 00:54:58,679 WARN [ReplicaManager broker=1] Unable to describe replica dirs for /var/lib/kafka/data-0/kafka-log1 (kafka.server.ReplicaManager) [data-plane-kafka-request-handler-0]
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/lib/kafka/data-0/kafka-log1 is offline
2023-08-22 00:55:28,680 WARN [ReplicaManager broker=1] Unable to describe replica dirs for /var/lib/kafka/data-0/kafka-log1 (kafka.server.ReplicaManager) [data-plane-kafka-request-handler-1]
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/lib/kafka/data-0/kafka-log1 is offline
2023-08-22 00:55:58,681 WARN [ReplicaManager broker=1] Unable to describe replica dirs for /var/lib/kafka/data-0/kafka-log1 (kafka.server.ReplicaManager) [data-plane-kafka-request-handler-2]
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/lib/kafka/data-0/kafka-log1 is offline

kafka crd:

apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
  kind: Kafka
  metadata:
    annotations:
      meta.helm.sh/release-name: kafka
      meta.helm.sh/release-namespace: kafka
    creationTimestamp: "2023-06-22T17:10:33Z"
    generation: 20
    labels:
      app.kubernetes.io/instance: kafka
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: kafka
      app.kubernetes.io/version: "1.0"
      helm.sh/chart: kafka-1.0
    name: plat
    namespace: kafka
    resourceVersion: "439730229"
    uid: b9a59a65-2261-4297-9114-98008ce3142c
  spec:
    cruiseControl:
      config:
        default.replication.throttle: 26214400
        webserver.security.enable: false
      resources:
        limits:
          cpu: "4"
          memory: 8Gi
    entityOperator:
      topicOperator: {}
      userOperator: {}
    kafka:
      config:
        compression.type: zstd
        default.replication.factor: 3
        inter.broker.protocol.version: 3.4
        message.max.bytes: 4194304
        min.insync.replicas: 2
        num.replica.fetchers: 4
        offsets.topic.replication.factor: 3
        replica.fetch.max.bytes: 4194304
        transaction.state.log.min.isr: 2
        transaction.state.log.replication.factor: 3
      listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
      - name: external
        port: 9094
        tls: false
        type: loadbalancer
      metricsConfig:
        type: jmxPrometheusExporter
        valueFrom:
          configMapKeyRef:
            key: kafka-metrics-config.yml
            name: kafka-metrics
      replicas: 3
      resources:
        limits:
          cpu: 8
          memory: 64Gi
      storage:
        type: jbod
        volumes:
        - deleteClaim: false
          id: 0
          size: 1000Gi
          type: persistent-claim
        - deleteClaim: false
          id: 1
          size: 1000Gi
          type: persistent-claim
      template:
        externalBootstrapService:
          metadata:
            annotations:
              service.beta.kubernetes.io/azure-load-balancer-internal: "true"
        perPodService:
          metadata:
            annotations:
              service.beta.kubernetes.io/azure-load-balancer-internal: "true"
      version: 3.4.0
    zookeeper:
      metricsConfig:
        type: jmxPrometheusExporter
        valueFrom:
          configMapKeyRef:
            key: zookeeper-metrics-config.yml
            name: kafka-metrics
      replicas: 3
      resources:
        limits:
          cpu: 4
          memory: 8Gi
        requests:
          cpu: 2
          memory: 8Gi
      storage:
        deleteClaim: false
        size: 150Gi
        type: persistent-claim
  status:
    clusterId: -nlIiyTjRFqUXeaIfBOb4A
    conditions:
    - lastTransitionTime: "2023-08-21T19:57:21.498936320Z"
      status: "True"
      type: Ready
    listeners:
    - addresses:
      - host: plat-kafka-bootstrap.kafka.svc
        port: 9092
      bootstrapServers: plat-kafka-bootstrap.kafka.svc:9092
      name: plain
      type: plain
    - addresses:
      - host: plat-kafka-bootstrap.kafka.svc
        port: 9093
      bootstrapServers: plat-kafka-bootstrap.kafka.svc:9093
      certificates:
      - |
       redacted
      name: tls
      type: tls
    - addresses:
      - host: 10.177.1.4
        port: 9094
      bootstrapServers: 10.177.1.4:9094
      name: external
      type: external
    observedGeneration: 20
kind: List
metadata:
  resourceVersion: ""

kafkaRebalance.yaml

apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
  kind: KafkaRebalance
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"kafka.strimzi.io/v1beta2","kind":"KafkaRebalance","metadata":{"annotations":{},"labels":{"strimzi.io/cluster":"plat"},"name":"plat-rebalance-test-12","namespace":"kafka"},"spec":{"rebalanceDisk":true}}
    creationTimestamp: "2023-08-22T00:34:09Z"
    generation: 1
    labels:
      strimzi.io/cluster: plat
    name: plat-rebalance-test-12
    namespace: kafka
    resourceVersion: "439893430"
    uid: 23e12c9d-cbc0-4336-adab-a7049aa2acec
  spec:
    rebalanceDisk: true
  status:
    conditions:
    - lastTransitionTime: "2023-08-22T00:39:01.433694040Z"
      status: "True"
      type: NotReady
    observedGeneration: 1
    sessionId: 0cb25ee8-cc22-43e8-97a8-5ad2d8e94079
kind: List
metadata:
  resourceVersion: ""

Additional context

This doesn't always seem to happen, I've had it happen 2 out of 5 times during tests. Initially I thought I was tapping out Azure disks so I added a replication.throttle but it does not seem the change the error behavior.

I've exec'd into the brokers and confirmed the disks do not lose connectivity in the pods so I'm unsure why the kafka brokers think they are offline, or why Cruise Control executor stops processing.

Seems related to the behavior seen here: https://cloud-native.slack.com/archives/CMH3Q3SNP/p1683333344439769

Aaron-ML commented 1 year ago

The kafka broker does not seem to be remediated, however if I recreate the broker it comes back online without issue. However cruise control will refuse to do any more remediation because of logs that have isFuture: true on the broker.

scholzj commented 1 year ago

I'm not sure this sounds like a Strimzi bug. Seems more like a bug in Cruise Control / Kafka.

scholzj commented 1 year ago

Triaged on the community call on 24th August: Looks like a Cruise control bug, but we should wait for the next call to see what @ppatierno and @kyguy think before closing this.

ppatierno commented 1 year ago

Taking a look at the Kafka client code, the currentReplicaLogDir can be actually null ...

static public class ReplicaLogDirInfo {
        // The current log directory of the replica of this partition on the given broker.
        // Null if no replica is not found for this partition on the given broker.
        private final String currentReplicaLogDir;

It sounds to me it's Cruise Control not handling correctly this situation. Due to the NPE it fails the operation so the error is returned to the operator.

scholzj commented 1 year ago

Triaged on 7.9.2023:

@kyguy is discussing this on Slack with @Aaron-ML. It does not seem to be a Strimzi issue so far. the Slack thread should be https://cloud-native.slack.com/archives/CMH3Q3SNP/p1692302211881709. The suspicion is that some Kafka broker disks might be broken.
@ppatierno raised if Cruise Control should have some null check. @kyguy suggests this might not be the issue. (in any case that would be a CC issue).

Let's keep it open until the next call or until Kyle confirms the analysis described above.

scholzj commented 1 year ago

Triaged on community call on 5.10.2023: Does not seem to be like a Strimzi issue and should be closed. If some new reasons show up to investigate it in Strimzi, it can be reopened.

strimzi / strimzi-kafka-operator