[Bug]: Kraft migration issues

rabii17 commented 4 months ago

Bug Description

During the migration, the Kraft controller is created then when the cluster started to rollout, an error message appears : The kafka configuration file appears to be for a legacy cluster. Formatting is only supported for clusters in KRaft mode.

Steps to reproduce

Create a kafkaNodePool for the brokers
Create KafkaNodePool for the controller
Switch the annotation value from disabled to migration

Expected behavior

Migration succeeded

Strimzi version

0.40.0

Kubernetes version

Kubernetes 1.27

Installation method

Helm Chart

Infrastructure

AKS

Kafka CR

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    strimzi.io/kraft: migration
    strimzi.io/node-pools: enabled
  labels:
    argocd.argoproj.io/instance: kafka-kafka-ece-nonprod
    env: dev
    release: kafka
  name: kafka
  namespace: kafka
  resourceVersion: '1196138514'
  uid: 6b3b540c-5ab6-4bcc-91fb-bc16d74db752
spec:
  clientsCa:
    generateCertificateAuthority: true
    validityDays: 1825
  clusterCa:
    generateCertificateAuthority: true
    validityDays: 1825
  entityOperator:
    topicOperator:
      jvmOptions:
        javaSystemProperties:
          - name: java.net.preferIPv4Stack
            value: 'true'
      logging:
        type: external
        valueFrom:
          configMapKeyRef:
            key: log4j2.properties
            name: kafka-entity-operators-logging
      resources:
        limits:
          memory: 512Mi
        requests:
          cpu: 60m
          memory: 512Mi
    userOperator:
      jvmOptions:
        javaSystemProperties:
          - name: java.net.preferIPv4Stack
            value: 'true'
      logging:
        type: external
        valueFrom:
          configMapKeyRef:
            key: log4j2.properties
            name: kafka-entity-operators-logging
      resources:
        limits:
          memory: 256Mi
        requests:
          cpu: 60m
          memory: 256Mi
      secretPrefix: kafka-
  kafka:
    authorization:
      superUsers:
        - CN=kafka-admin
      type: simple
    config:
      auto.create.topics.enable: 'true'
      default.replication.factor: 1
      log.cleanup.policy: delete
      log.retention.hours: 24
      log.roll.hours: 24
      num.partitions: 1
      offsets.topic.replication.factor: 1
      transaction.state.log.min.isr: 1
      transaction.state.log.replication.factor: 1
    listeners:
      - authentication:
          type: tls
        name: tls
        port: 9093
        tls: true
        type: internal
    logging:
      type: external
      valueFrom:
        configMapKeyRef:
          key: log4j.properties
          name: kafka-logging
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: kafka-metrics-config.yaml
          name: kafka-metrics
    template:
      clusterCaCert:
        metadata:
          annotations:
            reflector.v1.k8s.emberstack.com/reflection-allowed: 'true'
            reflector.v1.k8s.emberstack.com/reflection-auto-enabled: 'true'
            reflector.v1.k8s.emberstack.com/reflection-auto-namespaces: .*-dev
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: kafka
                topologyKey: kubernetes.io/hostname
      podDisruptionBudget:
        maxUnavailable: 0
    version: 3.7.0
  kafkaExporter:
    enableSaramaLogging: true
    groupRegex: .*
    logging: debug
    resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 256Mi
    template:
      pod:
        metadata:
          labels:
            prometheus: cluster-metrics
    topicRegex: .*
  zookeeper:
    config:
      autopurge.purgeInterval: 24
      autopurge.snapRetainCount: 3
    jvmOptions:
      '-Xms': 192m
      '-Xmx': 192m
      javaSystemProperties:
        - name: java.net.preferIPv4Stack
          value: 'true'
    logging:
      type: external
      valueFrom:
        configMapKeyRef:
          key: log4j.properties
          name: kafka-zookeeper-logging
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: zookeeper-metrics-config.yaml
          name: kafka-metrics
    replicas: 1
    resources:
      limits:
        memory: 384Mi
      requests:
        cpu: 100m
        memory: 384Mi
    storage:
      class: resizeable-azuredisk
      deleteClaim: true
      size: 20Gi
      type: persistent-claim
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: zookeeper
                topologyKey: kubernetes.io/hostname
      podDisruptionBudget:
        maxUnavailable: 0

Brokers KafkaNodePool CR

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  annotations:
  labels:
    argocd.argoproj.io/instance: kafka-kafka-ece-nonprod
    env: dev
    release: kafka
    strimzi.io/cluster: kafka
  name: kafka
  namespace: kafka
  resourceVersion: '1195978019'
  uid: cbe98173-3c76-4f21-8fd9-89a447feacdb
spec:
  jvmOptions:
    '-Xms': 256m
    '-Xmx': 256m
    javaSystemProperties:
      - name: java.net.preferIPv4Stack
        value: 'true'
  replicas: 3
  resources:
    limits:
      memory: 768Mi
    requests:
      cpu: 200m
      memory: 640Mi
  roles:
    - broker
  storage:
    type: ephemeral

Controllers KafkaNodePool CR

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  labels:
    argocd.argoproj.io/instance: kafka-kafka-ece-nonprod
    env: dev
    release: kafka
    strimzi.io/cluster: kafka
  name: controller
  namespace: kafka
spec:
  jvmOptions:
    '-Xms': 192m
    '-Xmx': 192m
    javaSystemProperties:
      - name: java.net.preferIPv4Stack
        value: 'true'
  replicas: 1
  resources:
    limits:
      memory: 384Mi
    requests:
      cpu: 100m
      memory: 384Mi
  roles:
    - controller
  storage:
    class: resizeable-azuredisk
    deleteClaim: true
    size: 20Gi
    type: persistent-claim

Configuration files and logs

removed directory '/tmp/hsperfdata_kafka'
removed '/tmp/kafka/strimzi.kafka.metadata.config.state'
removed '/tmp/kafka/clients.truststore.p12'
removed '/tmp/kafka/cluster.keystore.p12'
removed '/tmp/kafka/cluster.truststore.p12'
removed directory '/tmp/kafka'
removed '/tmp/strimzi.properties'
STRIMZI_BROKER_ID=0
Preparing truststore for replication listener
Adding /opt/kafka/cluster-ca-certs/ca.crt to truststore /tmp/kafka/cluster.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for replication listener is complete
Looking for the right CA
Found the right CA: /opt/kafka/cluster-ca-certs/ca.crt
Preparing keystore for replication and clienttls listener
Preparing keystore for replication and clienttls listener is complete
Preparing truststore for client authentication
Adding /opt/kafka/client-ca-certs/ca.crt to truststore /tmp/kafka/clients.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for client authentication is complete
Starting Kafka with configuration:
##############################
##############################
# This file is automatically generated by the Strimzi Cluster Operator
# Any changes to this file will be ignored and overwritten!
##############################
##############################
##########
# Node / Broker ID
##########
node.id=0
broker.id=0
##########
# Kafka message logs configuration
##########
log.dirs=/var/lib/kafka/data/kafka-log0
##########
# Control Plane listener
##########
listener.name.controlplane-9090.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.controlplane-9090.ssl.keystore.password=[hidden]
listener.name.controlplane-9090.ssl.keystore.type=PKCS12
listener.name.controlplane-9090.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12
listener.name.controlplane-9090.ssl.truststore.password=[hidden]
listener.name.controlplane-9090.ssl.truststore.type=PKCS12
listener.name.controlplane-9090.ssl.client.auth=required
##########
# Replication listener
##########
listener.name.replication-9091.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.replication-9091.ssl.keystore.password=[hidden]
listener.name.replication-9091.ssl.keystore.type=PKCS12
listener.name.replication-9091.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12
listener.name.replication-9091.ssl.truststore.password=[hidden]
listener.name.replication-9091.ssl.truststore.type=PKCS12
listener.name.replication-9091.ssl.client.auth=required
##########
# Listener configuration: TLS-9093
##########
listener.name.tls-9093.ssl.client.auth=required
listener.name.tls-9093.ssl.truststore.location=/tmp/kafka/clients.truststore.p12
listener.name.tls-9093.ssl.truststore.password=[hidden]
listener.name.tls-9093.ssl.truststore.type=PKCS12
listener.name.tls-9093.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.tls-9093.ssl.keystore.password=[hidden]
listener.name.tls-9093.ssl.keystore.type=PKCS12
##########
# Common listener configuration
##########
listener.security.protocol.map=CONTROLPLANE-9090:SSL,REPLICATION-9091:SSL,TLS-9093:SSL
listeners=CONTROLPLANE-9090://0.0.0.0:9090,REPLICATION-9091://0.0.0.0:9091,TLS-9093://0.0.0.0:9093
advertised.listeners=CONTROLPLANE-9090://kafka-0.kafka-brokers.kafka.svc:9090,REPLICATION-9091://kafka-0.kafka-brokers.kafka.svc:9091,TLS-9093://kafka-0.kafka-brokers.kafka.svc:9093
inter.broker.listener.name=REPLICATION-9091
control.plane.listener.name=CONTROLPLANE-9090
sasl.enabled.mechanisms=
ssl.endpoint.identification.algorithm=HTTPS
##########
# Authorization
##########
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
super.users=User:CN=kafka,O=io.strimzi;User:CN=kafka-entity-topic-operator,O=io.strimzi;User:CN=kafka-entity-user-operator,O=io.strimzi;User:CN=kafka-exporter,O=io.strimzi;User:CN=kafka-cruise-control,O=io.strimzi;User:CN=cluster-operator,O=io.strimzi;User:CN=kafka-admin
##########
# User provided configuration
##########
auto.create.topics.enable=true
default.replication.factor=1
log.cleanup.policy=delete
log.retention.hours=24
log.roll.hours=24
num.partitions=1
offsets.topic.replication.factor=1
transaction.state.log.min.isr=1
transaction.state.log.replication.factor=1
inter.broker.protocol.version=3.7
log.message.format.version=3.7
##########
# Zookeeper
##########
zookeeper.connect=kafka-zookeeper-client:2181
zookeeper.clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
zookeeper.ssl.client.enable=true
zookeeper.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
zookeeper.ssl.keystore.password=[hidden]
zookeeper.ssl.keystore.type=PKCS12
zookeeper.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12
zookeeper.ssl.truststore.password=[hidden]
zookeeper.ssl.truststore.type=PKCS12
##########
# Zookeeper migration
##########
zookeeper.metadata.migration.enable=true
##########
# KRaft configuration
##########
controller.listener.names=CONTROLPLANE-9090
controller.quorum.voters=3@kafka-controller-3.kafka-brokers.kafka.svc.cluster.local:9090
Kafka metadata config state [2]
Using KRaft [true]
Formatting Kraft storage with cluster ID CNGhBzfkTaK1uMqQ6ZHBcA and metadata version 3.7
{"@timestamp":"2024-06-03T13:37:57.891Z","source_host":"kafka-0","file":"Logging.scala","method":"<clinit>","level":"INFO","line_number":"31","thread_name":"main","@version":1,"logger_name":"kafka.utils.Log4jControllerRegistration$","message":"Registered kafka:type=kafka.Log4jController MBean","class":"kafka.utils.Log4jControllerRegistration$","mdc":{}}
{"@timestamp":"2024-06-03T13:37:58.163Z","source_host":"kafka-0","file":"AbstractConfig.java","method":"logAll","level":"INFO","line_number":"370","thread_name":"main","@version":1,"logger_name":"kafka.server.KafkaConfig","message":"KafkaConfig values: \n\tadvertised.listeners = CONTROLPLANE-9090:\/\/kafka-0.kafka-brokers.kafka.svc:9090,REPLICATION-9091:\/\/kafka-0.kafka-brokers.kafka.svc:9091,TLS-9093:\/\/kafka-0.kafka-brokers.kafka.svc:9093\n\talter.config.policy.class.name = null\n\talter.log.dirs.replication.quota.window.num = 11\n\talter.log.dirs.replication.quota.window.size.seconds = 1\n\tauthorizer.class.name = kafka.security.authorizer.AclAuthorizer\n\tauto.create.topics.enable = true\n\tauto.include.jmx.reporter = true\n\tauto.leader.rebalance.enable = true\n\tbackground.threads = 10\n\tbroker.heartbeat.interval.ms = 2000\n\tbroker.id = 0\n\tbroker.id.generation.enable = true\n\tbroker.rack = null\n\tbroker.session.timeout.ms = 9000\n\tclient.quota.callback.class = null\n\tcompression.type = producer\n\tconnection.failed.authentication.delay.ms = 100\n\tconnections.max.idle.ms = 600000\n\tconnections.max.reauth.ms = 0\n\tcontrol.plane.listener.name = CONTROLPLANE-9090\n\tcontrolled.shutdown.enable = true\n\tcontrolled.shutdown.max.retries = 3\n\tcontrolled.shutdown.retry.backoff.ms = 5000\n\tcontroller.listener.names = CONTROLPLANE-9090\n\tcontroller.quorum.append.linger.ms = 25\n\tcontroller.quorum.election.backoff.max.ms = 1000\n\tcontroller.quorum.election.timeout.ms = 1000\n\tcontroller.quorum.fetch.timeout.ms = 2000\n\tcontroller.quorum.request.timeout.ms = 2000\n\tcontroller.quorum.retry.backoff.ms = 20\n\tcontroller.quorum.voters = [3@kafka-controller-3.kafka-brokers.kafka.svc.cluster.local:9090]\n\tcontroller.quota.window.num = 11\n\tcontroller.quota.window.size.seconds = 1\n\tcontroller.socket.timeout.ms = 30000\n\tcreate.topic.policy.class.name = null\n\tdefault.replication.factor = 1\n\tdelegation.token.expiry.check.interval.ms = 3600000\n\tdelegation.token.expiry.time.ms = 86400000\n\tdelegation.token.master.key = null\n\tdelegation.token.max.lifetime.ms = 604800000\n\tdelegation.token.secret.key = null\n\tdelete.records.purgatory.purge.interval.requests = 1\n\tdelete.topic.enable = true\n\tearly.start.listeners = null\n\teligible.leader.replicas.enable = false\n\tfetch.max.bytes = 57671680\n\tfetch.purgatory.purge.interval.requests = 1000\n\tgroup.consumer.assignors = [org.apache.kafka.coordinator.group.assignor.UniformAssignor, org.apache.kafka.coordinator.group.assignor.RangeAssignor]\n\tgroup.consumer.heartbeat.interval.ms = 5000\n\tgroup.consumer.max.heartbeat.interval.ms = 15000\n\tgroup.consumer.max.session.timeout.ms = 60000\n\tgroup.consumer.max.size = 2147483647\n\tgroup.consumer.min.heartbeat.interval.ms = 5000\n\tgroup.consumer.min.session.timeout.ms = 45000\n\tgroup.consumer.session.timeout.ms = 45000\n\tgroup.coordinator.new.enable = false\n\tgroup.coordinator.rebalance.protocols = [classic]\n\tgroup.coordinator.threads = 1\n\tgroup.initial.rebalance.delay.ms = 3000\n\tgroup.max.session.timeout.ms = 1800000\n\tgroup.max.size = 2147483647\n\tgroup.min.session.timeout.ms = 6000\n\tinitial.broker.registration.timeout.ms = 60000\n\tinter.broker.listener.name = REPLICATION-9091\n\tinter.broker.protocol.version = 3.7\n\tkafka.metrics.polling.interval.secs = 10\n\tkafka.metrics.reporters = []\n\tleader.imbalance.check.interval.seconds = 300\n\tleader.imbalance.per.broker.percentage = 10\n\tlistener.security.protocol.map = CONTROLPLANE-9090:SSL,REPLICATION-9091:SSL,TLS-9093:SSL\n\tlisteners = CONTROLPLANE-9090:\/\/0.0.0.0:9090,REPLICATION-9091:\/\/0.0.0.0:9091,TLS-9093:\/\/0.0.0.0:9093\n\tlog.cleaner.backoff.ms = 15000\n\tlog.cleaner.dedupe.buffer.size = 134217728\n\tlog.cleaner.delete.retention.ms = 86400000\n\tlog.cleaner.enable = true\n\tlog.cleaner.io.buffer.load.factor = 0.9\n\tlog.cleaner.io.buffer.size = 524288\n\tlog.cleaner.io.max.bytes.per.second = 1.7976931348623157E308\n\tlog.cleaner.max.compaction.lag.ms = 9223372036854775807\n\tlog.cleaner.min.cleanable.ratio = 0.5\n\tlog.cleaner.min.compaction.lag.ms = 0\n\tlog.cleaner.threads = 1\n\tlog.cleanup.policy = [delete]\n\tlog.dir = \/tmp\/kafka-logs\n\tlog.dirs = \/var\/lib\/kafka\/data\/kafka-log0\n\tlog.flush.interval.messages = 9223372036854775807\n\tlog.flush.interval.ms = null\n\tlog.flush.offset.checkpoint.interval.ms = 60000\n\tlog.flush.scheduler.interval.ms = 9223372036854775807\n\tlog.flush.start.offset.checkpoint.interval.ms = 60000\n\tlog.index.interval.bytes = 4096\n\tlog.index.size.max.bytes = 10485760\n\tlog.local.retention.bytes = -2\n\tlog.local.retention.ms = -2\n\tlog.message.downconversion.enable = true\n\tlog.message.format.version = 3.7\n\tlog.message.timestamp.after.max.ms = 9223372036854775807\n\tlog.message.timestamp.before.max.ms = 9223372036854775807\n\tlog.message.timestamp.difference.max.ms = 9223372036854775807\n\tlog.message.timestamp.type = CreateTime\n\tlog.preallocate = false\n\tlog.retention.bytes = -1\n\tlog.retention.check.interval.ms = 300000\n\tlog.retention.hours = 24\n\tlog.retention.minutes = null\n\tlog.retention.ms = null\n\tlog.roll.hours = 24\n\tlog.roll.jitter.hours = 0\n\tlog.roll.jitter.ms = null\n\tlog.roll.ms = null\n\tlog.segment.bytes = 1073741824\n\tlog.segment.delete.delay.ms = 60000\n\tmax.connection.creation.rate = 2147483647\n\tmax.connections = 2147483647\n\tmax.connections.per.ip = 2147483647\n\tmax.connections.per.ip.overrides = \n\tmax.incremental.fetch.session.cache.slots = 1000\n\tmessage.max.bytes = 1048588\n\tmetadata.log.dir = null\n\tmetadata.log.max.record.bytes.between.snapshots = 20971520\n\tmetadata.log.max.snapshot.interval.ms = 3600000\n\tmetadata.log.segment.bytes = 1073741824\n\tmetadata.log.segment.min.bytes = 8388608\n\tmetadata.log.segment.ms = 604800000\n\tmetadata.max.idle.interval.ms = 500\n\tmetadata.max.retention.bytes = 104857600\n\tmetadata.max.retention.ms = 604800000\n\tmetric.reporters = []\n\tmetrics.num.samples = 2\n\tmetrics.recording.level = INFO\n\tmetrics.sample.window.ms = 30000\n\tmin.insync.replicas = 1\n\tnode.id = 0\n\tnum.io.threads = 8\n\tnum.network.threads = 3\n\tnum.partitions = 1\n\tnum.recovery.threads.per.data.dir = 1\n\tnum.replica.alter.log.dirs.threads = null\n\tnum.replica.fetchers = 1\n\toffset.metadata.max.bytes = 4096\n\toffsets.commit.required.acks = -1\n\toffsets.commit.timeout.ms = 5000\n\toffsets.load.buffer.size = 5242880\n\toffsets.retention.check.interval.ms = 600000\n\toffsets.retention.minutes = 10080\n\toffsets.topic.compression.codec = 0\n\toffsets.topic.num.partitions = 50\n\toffsets.topic.replication.factor = 1\n\toffsets.topic.segment.bytes = 104857600\n\tpassword.encoder.cipher.algorithm = AES\/CBC\/PKCS5Padding\n\tpassword.encoder.iterations = 4096\n\tpassword.encoder.key.length = 128\n\tpassword.encoder.keyfactory.algorithm = null\n\tpassword.encoder.old.secret = null\n\tpassword.encoder.secret = null\n\tprincipal.builder.class = class org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder\n\tprocess.roles = []\n\tproducer.id.expiration.check.interval.ms = 600000\n\tproducer.id.expiration.ms = 86400000\n\tproducer.purgatory.purge.interval.requests = 1000\n\tqueued.max.request.bytes = -1\n\tqueued.max.requests = 500\n\tquota.window.num = 11\n\tquota.window.size.seconds = 1\n\tremote.log.index.file.cache.total.size.bytes = 1073741824\n\tremote.log.manager.task.interval.ms = 30000\n\tremote.log.manager.task.retry.backoff.max.ms = 30000\n\tremote.log.manager.task.retry.backoff.ms = 500\n\tremote.log.manager.task.retry.jitter = 0.2\n\tremote.log.manager.thread.pool.size = 10\n\tremote.log.metadata.custom.metadata.max.bytes = 128\n\tremote.log.metadata.manager.class.name = org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager\n\tremote.log.metadata.manager.class.path = null\n\tremote.log.metadata.manager.impl.prefix = rlmm.config.\n\tremote.log.metadata.manager.listener.name = null\n\tremote.log.reader.max.pending.tasks = 100\n\tremote.log.reader.threads = 10\n\tremote.log.storage.manager.class.name = null\n\tremote.log.storage.manager.class.path = null\n\tremote.log.storage.manager.impl.prefix = rsm.config.\n\tremote.log.storage.system.enable = false\n\treplica.fetch.backoff.ms = 1000\n\treplica.fetch.max.bytes = 1048576\n\treplica.fetch.min.bytes = 1\n\treplica.fetch.response.max.bytes = 10485760\n\treplica.fetch.wait.max.ms = 500\n\treplica.high.watermark.checkpoint.interval.ms = 5000\n\treplica.lag.time.max.ms = 30000\n\treplica.selector.class = null\n\treplica.socket.receive.buffer.bytes = 65536\n\treplica.socket.timeout.ms = 30000\n\treplication.quota.window.num = 11\n\treplication.quota.window.size.seconds = 1\n\trequest.timeout.ms = 30000\n\treserved.broker.max.id = 1000\n\tsasl.client.callback.handler.class = null\n\tsasl.enabled.mechanisms = []\n\tsasl.jaas.config = null\n\tsasl.kerberos.kinit.cmd = \/usr\/bin\/kinit\n\tsasl.kerberos.min.time.before.relogin = 60000\n\tsasl.kerberos.principal.to.local.rules = [DEFAULT]\n\tsasl.kerberos.service.name = null\n\tsasl.kerberos.ticket.renew.jitter = 0.05\n\tsasl.kerberos.ticket.renew.window.factor = 0.8\n\tsasl.login.callback.handler.class = null\n\tsasl.login.class = null\n\tsasl.login.connect.timeout.ms = null\n\tsasl.login.read.timeout.ms = null\n\tsasl.login.refresh.buffer.seconds = 300\n\tsasl.login.refresh.min.period.seconds = 60\n\tsasl.login.refresh.window.factor = 0.8\n\tsasl.login.refresh.window.jitter = 0.05\n\tsasl.login.retry.backoff.max.ms = 10000\n\tsasl.login.retry.backoff.ms = 100\n\tsasl.mechanism.controller.protocol = GSSAPI\n\tsasl.mechanism.inter.broker.protocol = GSSAPI\n\tsasl.oauthbearer.clock.skew.seconds = 30\n\tsasl.oauthbearer.expected.audience = null\n\tsasl.oauthbearer.expected.issuer = null\n\tsasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000\n\tsasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000\n\tsasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100\n\tsasl.oauthbearer.jwks.endpoint.url = null\n\tsasl.oauthbearer.scope.claim.name = scope\n\tsasl.oauthbearer.sub.claim.name = sub\n\tsasl.oauthbearer.token.endpoint.url = null\n\tsasl.server.callback.handler.class = null\n\tsasl.server.max.receive.size = 524288\n\tsecurity.inter.broker.protocol = PLAINTEXT\n\tsecurity.providers = null\n\tserver.max.startup.time.ms = 9223372036854775807\n\tsocket.connection.setup.timeout.max.ms = 30000\n\tsocket.connection.setup.timeout.ms = 10000\n\tsocket.listen.backlog.size = 50\n\tsocket.receive.buffer.bytes = 102400\n\tsocket.request.max.bytes = 104857600\n\tsocket.send.buffer.bytes = 102400\n\tssl.allow.dn.changes = false\n\tssl.allow.san.changes = false\n\tssl.cipher.suites = []\n\tssl.client.auth = none\n\tssl.enabled.protocols = [TLSv1.2, TLSv1.3]\n\tssl.endpoint.identification.algorithm = HTTPS\n\tssl.engine.factory.class = null\n\tssl.key.password = null\n\tssl.keymanager.algorithm = SunX509\n\tssl.keystore.certificate.chain = null\n\tssl.keystore.key = null\n\tssl.keystore.location = null\n\tssl.keystore.password = null\n\tssl.keystore.type = JKS\n\tssl.principal.mapping.rules = DEFAULT\n\tssl.protocol = TLSv1.3\n\tssl.provider = null\n\tssl.secure.random.implementation = null\n\tssl.trustmanager.algorithm = PKIX\n\tssl.truststore.certificates = null\n\tssl.truststore.location = null\n\tssl.truststore.password = null\n\tssl.truststore.type = JKS\n\ttelemetry.max.bytes = 1048576\n\ttransaction.abort.timed.out.transaction.cleanup.interval.ms = 10000\n\ttransaction.max.timeout.ms = 900000\n\ttransaction.partition.verification.enable = true\n\ttransaction.remove.expired.transaction.cleanup.interval.ms = 3600000\n\ttransaction.state.log.load.buffer.size = 5242880\n\ttransaction.state.log.min.isr = 1\n\ttransaction.state.log.num.partitions = 50\n\ttransaction.state.log.replication.factor = 1\n\ttransaction.state.log.segment.bytes = 104857600\n\ttransactional.id.expiration.ms = 604800000\n\tunclean.leader.election.enable = false\n\tunstable.api.versions.enable = false\n\tunstable.metadata.versions.enable = false\n\tzookeeper.clientCnxnSocket = org.apache.zookeeper.ClientCnxnSocketNetty\n\tzookeeper.connect = kafka-zookeeper-client:2181\n\tzookeeper.connection.timeout.ms = null\n\tzookeeper.max.in.flight.requests = 10\n\tzookeeper.metadata.migration.enable = true\n\tzookeeper.metadata.migration.min.batch.size = 200\n\tzookeeper.session.timeout.ms = 18000\n\tzookeeper.set.acl = false\n\tzookeeper.ssl.cipher.suites = null\n\tzookeeper.ssl.client.enable = true\n\tzookeeper.ssl.crl.enable = false\n\tzookeeper.ssl.enabled.protocols = null\n\tzookeeper.ssl.endpoint.identification.algorithm = HTTPS\n\tzookeeper.ssl.keystore.location = \/tmp\/kafka\/cluster.keystore.p12\n\tzookeeper.ssl.keystore.password = [hidden]\n\tzookeeper.ssl.keystore.type = PKCS12\n\tzookeeper.ssl.ocsp.enable = false\n\tzookeeper.ssl.protocol = TLSv1.2\n\tzookeeper.ssl.truststore.location = \/tmp\/kafka\/cluster.truststore.p12\n\tzookeeper.ssl.truststore.password = [hidden]\n\tzookeeper.ssl.truststore.type = PKCS12\n","class":"org.apache.kafka.common.config.AbstractConfig","mdc":{}}
{"@timestamp":"2024-06-03T13:37:58.186Z","source_host":"kafka-0","file":"X509Util.java","method":"<clinit>","level":"INFO","line_number":"78","thread_name":"main","@version":1,"logger_name":"org.apache.zookeeper.common.X509Util","message":"Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation","class":"org.apache.zookeeper.common.X509Util","mdc":{}}
The kafka configuration file appears to be for a legacy cluster. Formatting is only supported for clusters in KRaft mode.

### Additional context

The cluster has 3 brokers with ephemeral storage and 1 zookeeper node.

scholzj commented 4 months ago

Please format the log to make it readable. You should probably also share your custom resources.

@ppatierno: I guess this is for you.

rabii17 commented 4 months ago

Please format the log to make it readable. You should probably also share your custom resources.

@ppatierno: I guess this is for you.

Thank you for your reply, I just added the custom resources and modified the logs format

scholzj commented 4 months ago

Paolo is the expert on migration. But I wonder if ephemeral storage is the problem here. It means you start with an empty disk every time you roll the brokers and that can cause all kind of issues.

I'm not sure if this is expected error (and if it is, it should be properly documented). But in general:

With a single controller node, I do not think it will really work as you lose all the data in every restart. You would need at least 3 controller nodes to survive rolling updates. Also, clusters with ephemral storage in general are suitable only for some short-lived development or CI clusters
Migration of such cluster makes little sense as you could basically just delete it and deploy a new Kafka cluster with Kraft directly (using ephemeral storage implies that you do not really care about not losing the data).

rabii17 commented 4 months ago

@scholzj It is a dev environment on which we want to start testing the migration and that's why we are using ephemeral storage. The controller pod started as expected without any issues when we switched the annotation to migration. The provided logs are from the first rebooted broker.

scholzj commented 4 months ago

Well, yeah -> the controller will work on the first restart as it is expected to have a brand new volume there. But will likely have a problem in the next restart anyway and it will loose all the data that are suposed to be migrated from the ZooKeeper cluster. The broker is not expected to be empty at this point, so that is likely why you have this issue. As I said, there might be things to improve, but migrating a cluster like this will probably never work.

ppatierno commented 4 months ago

I can confirm what Jakub already said. The migration isn't meant to be used for clusters based on ephemeral storage, it cannot work because of the need for a cluster id and related node formatting which can't work when the storage is empty on each restart. This is true for brokers and controllers. As Jakub said, controllers will start the first time but then they will have same problem on the next restart which happens later in the migration process. I guess it's not documented while it should be. I will find some good place to highlight it in the documentation.

ppatierno commented 3 months ago

@rabii17 so it seems this problem was fixed in 0.41.0 (while you are using 0.40.0). I tried with 1 controller using ephemeral (or persistent) and 3 brokers using ephemeral as well and everything seems to work fine. Could you give it a try by using 0.41.0 release please?

ppatierno commented 3 months ago

@rabii17 just to be precise when using 1 controller with ephemeral storage the rolling works without errors but you are going to lose metadata synced with ZooKeeper during migration when a controller rolling happens. So one controller with ephemeral can't work. In the linked PR there is an addition to the doc making it clearer.

ppatierno commented 3 months ago

Triaged on 13/6/2024: discussed if it would be useful to have at least a warning or maybe blocking the user who is trying to migrate with just 1 controller using ephemeral storage. Keeping this open for the next community call.

scholzj commented 2 months ago

Discussed on the community call on 10.7.2024: We document on multiple places that ephemeral storage is suposed to be used only for development and shortlived clusters in CIs etc. The migration actually works with multiple ndoes with ephemeral storage, only the one ephemeral node is an issue. We should not increase the complexity for this.

strimzi / strimzi-kafka-operator