NPE during installation

shawkins commented 3 years ago

Describe the bug

During an install an NPE is seen, but eventually the install is successful.

2021-07-13 02:33:49 INFO  CrdOperator:108 - Status of Kafka cicdcluster in namespace kafka has been updated
2021-07-13 02:33:49 INFO  OperatorWatcher:40 - Reconciliation #10(watch) Kafka(kafka/cicdcluster): Kafka cicdcluster in namespace kafka was MODIFIED
2021-07-13 02:33:49 WARN  AbstractOperator:516 - Reconciliation #7(watch) Kafka(kafka/cicdcluster): Failed to reconcile
java.lang.NullPointerException: null
        at io.strimzi.operator.cluster.operator.resource.StatefulSetOperator.getSecrets(StatefulSetOperator.java:102) ~[io.strimzi.cluster-operator-0.23.0.redhat-00001.jar:0.23.0.redhat-00001]
        at io.strimzi.operator.cluster.operator.resource.StatefulSetOperator.maybeRollingUpdate(StatefulSetOperator.java:96) ~[io.strimzi.cluster-operator-0.23.0.redhat-00001.jar:0.23.0.redhat-00001]
        at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperator$ReconciliationState.lambda$zkRollingUpdate$39(KafkaAssemblyOperator.java:1231) ~[io.strimzi.cluster-operator-0.23.0.redhat-00001.jar:0.23.0.redhat-00001]
        at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:38) ~[io.vertx.vertx-core-4.0.3.redhat-00002.jar:4.0.3.redhat-00002]
        at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:62) ~[io.vertx.vertx-core-4.0.3.redhat-00002.jar:4.0.3.redhat-00002]
        at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:179) ~[io.vertx.vertx-core-4.0.3.redhat-00002.jar:4.0.3.redhat-00002]
        at io.vertx.core.impl.future.PromiseImpl.tryComplete(PromiseImpl.java:23) ~[io.vertx.vertx-core-4.0.3.redhat-00002.jar:4.0.3.redhat-00002]
        at io.vertx.core.impl.future.PromiseImpl.onSuccess(PromiseImpl.java:49) ~[io.vertx.vertx-core-4.0.3.redhat-00002.jar:4.0.3.redhat-00002]
        at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54) ~[io.vertx.vertx-core-4.0.3.redhat-00002.jar:4.0.3.redhat-00002]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty.netty-common-4.1.60.Final-redhat-00001.jar:4.1.60.Final-redhat-00001]

To Reproduce Steps to reproduce the behavior:

Create Kafka x in some namespace.
Wait for x to be ready.
Perform a foreground deletion of x and wait for the Kafka resource to be gone.
Create another Kafka x in the same namespace.

Expected behavior Ideally the NPE would not occur.

Environment (please complete the following information):

Strimzi version: 0.23
Installation method: fabric8 kubernetes-client
Kubernetes cluster: OpenShift 4.7
Infrastructure: AWS

YAML files and logs

Can be provided if needed.

Additional context

scholzj commented 3 years ago

Can you share the full logs please and your configuration, how the Kafka CRs look like etc.?

scholzj commented 3 years ago

PS: You should use different clusters (with different names) for things such as CI/CD. Not create a cluster with the same name again and again.

shawkins commented 3 years ago

Not create a cluster with the same name again and again.

Yes, we'll certainly do that. Just wanted to capture the issue.

shawkins commented 3 years ago

example kafka:

apiVersion: "kafka.strimzi.io/v1beta2"
kind: "Kafka"
metadata:
  creationTimestamp: "2021-07-20T20:18:35Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: "kas-fleetshard-operator"
    ingressType: "sharded"
    managedkafka.bf2.org/strimziVersion: "strimzi-cluster-operator.v0.23.0-0"
  name: "cicdcluster"
  namespace: "kafka"
  ownerReferences:
  - apiVersion: "managedkafka.bf2.org/v1alpha1"
    kind: "ManagedKafka"
    name: "cicdcluster"
    uid: "e10ea122-baa5-41c9-a378-c191ad46ac29"
  resourceVersion: "475225"
  selfLink: "/apis/kafka.strimzi.io/v1beta2/namespaces/kafka/kafkas/cicdcluster"
  uid: "16a29bd4-345b-4261-84c0-40258e8707b9"
spec:
  kafka:
    version: "2.7.0"
    replicas: 3
    listeners:
    - name: "tls"
      port: 9093
      type: "internal"
      tls: true
    - name: "external"
      port: 9094
      type: "route"
      tls: true
      configuration:
        brokerCertChainAndKey:
          secretName: "cicdcluster-tls-secret"
          certificate: "tls.crt"
          key: "tls.key"
        bootstrap:
          host: "cicdcluster-kafka-bootstrap-kafka.apps.shawkins-kafka.johv.s1.devshift.org"
        brokers:
        - broker: 0
          host: "broker-0-cicdcluster-kafka-bootstrap-kafka.apps.shawkins-kafka.johv.s1.devshift.org"
        - broker: 1
          host: "broker-1-cicdcluster-kafka-bootstrap-kafka.apps.shawkins-kafka.johv.s1.devshift.org"
        - broker: 2
          host: "broker-2-cicdcluster-kafka-bootstrap-kafka.apps.shawkins-kafka.johv.s1.devshift.org"
        maxConnections: 166
        maxConnectionCreationRate: 33
    - name: "oauth"
      port: 9095
      type: "internal"
      tls: false
    - name: "sre"
      port: 9096
      type: "internal"
      tls: false
    config:
      auto.create.topics.enable: "false"
      default.replication.factor: 3
      inter.broker.protocol.version: "2.7.0"
      leader.imbalance.per.broker.percentage: 0
      log.message.format.version: "2.7.0"
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      ssl.enabled.protocols: "TLSv1.3,TLSv1.2"
      ssl.protocol: "TLS"
      strimzi.authorization.global-authorizer.acl.1: "permission=allow;topic=*;operations=all"
      strimzi.authorization.global-authorizer.acl.2: "permission=allow;group=*;operations=all"
      strimzi.authorization.global-authorizer.acl.3: "permission=allow;transactional_id=*;operations=all"
      strimzi.authorization.global-authorizer.allowed-listeners: "TLS-9093,SRE-9096"
      transaction.state.log.min.isr: 2
      transaction.state.log.replication.factor: 3
    storage:
      volumes:
      - type: "persistent-claim"
        size: "238609294222"
        class: "gp2"
        deleteClaim: true
        id: 0
      type: "jbod"
    authorization:
      type: "custom"
      authorizerClass: "io.bf2.kafka.authorizer.GlobalAclAuthorizer"
    rack:
      topologyKey: "topology.kubernetes.io/zone"
    jvmOptions:
      "-Xmx": "3G"
      "-Xms": "3G"
      "-XX":
        ExitOnOutOfMemoryError: "true"
    resources:
      limits:
        cpu: "2500m"
        memory: "11Gi"
      requests:
        cpu: "2500m"
        memory: "11Gi"
    metricsConfig:
      type: "jmxPrometheusExporter"
      valueFrom:
        configMapKeyRef:
          key: "jmx-exporter-config"
          name: "cicdcluster-kafka-metrics"
    logging:
      type: "external"
      valueFrom:
        configMapKeyRef:
          key: "log4j.properties"
          name: "cicdcluster-kafka-logging"
          optional: false
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/name: "kafka"
              topologyKey: "kubernetes.io/hostname"
      podDisruptionBudget:
        maxUnavailable: 0
  zookeeper:
    replicas: 3
    storage:
      type: "persistent-claim"
      size: "10Gi"
      class: "gp2"
      deleteClaim: true
    jvmOptions:
      "-Xmx": "1G"
      "-Xms": "1G"
      "-XX":
        ExitOnOutOfMemoryError: "true"
    resources:
      limits:
        cpu: "1000m"
        memory: "4Gi"
      requests:
        cpu: "1000m"
        memory: "4Gi"
    metricsConfig:
      type: "jmxPrometheusExporter"
      valueFrom:
        configMapKeyRef:
          key: "jmx-exporter-config"
          name: "cicdcluster-zookeeper-metrics"
    logging:
      type: "external"
      valueFrom:
        configMapKeyRef:
          key: "log4j.properties"
          name: "cicdcluster-zookeeper-logging"
          optional: false
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: "kubernetes.io/hostname"
            - topologyKey: "topology.kubernetes.io/zone"
      podDisruptionBudget:
        maxUnavailable: 0
  kafkaExporter:
    resources:
      limits:
        cpu: "1000m"
        memory: "256Mi"
      requests:
        cpu: "500m"
        memory: "128Mi"
    template:
      pod:
        affinity:
          podAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  strimzi.io/name: "cicdcluster-zookeeper"
              topologyKey: "kubernetes.io/hostname"
status:
  conditions:
  - type: "NotReady"
    status: "True"
    lastTransitionTime: "2021-07-20T20:18:36.114Z"
    reason: "Creating"
    message: "Kafka cluster is being deployed"
  observedGeneration: 0
kind: "Kafka"

Based up the stacktrace at that point either the statefulset or it's labels are null. I'll be able to get a strimzi log if needed - but this is definitely a low priority as it's only occurring during test runs.

scholzj commented 3 years ago

I think log would be needed for this to understand what exactly was happening ... ideally on DEBUG level.

shawkins commented 2 years ago

I think this can be closed as we are using different instance names now and I was unsuccessful in my last attempt to recreate.

strimzi / strimzi-kafka-operator

NPE during installation #5325