strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.8k stars 1.28k forks source link

[Bug]: HttpTimeoutException on initial StrimziPodSetController reconciliation #8531

Closed brsolomon-deloitte closed 1 year ago

brsolomon-deloitte commented 1 year ago

Bug Description

On the initial installation of Kafka using strimzi-kafka-operator, we are seeing an HttpTimeoutException: request timed out in the deployment/strimzi-kafka-operator logs. No StrimziPodSet are created (either for Kafka or Zookeeper).

Steps to reproduce

  1. Install strimzi-kafka-operator to the kafka-system namespace with following values.yaml:
resources:
  limits:
    cpu: 1000m
    memory: 512Mi
  requests:
    cpu: 200m
    memory: 512Mi
watchNamespaces:
  - "elastic"
$ kubectl get all -n kafka-system
NAME                                            READY   STATUS    RESTARTS   AGE
pod/strimzi-cluster-operator-6ffb44b5c8-qlgw8   1/1     Running   0          14h

NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/strimzi-cluster-operator   1/1     1            1           6d20h

NAME                                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/strimzi-cluster-operator-6ffb44b5c8   1         1         1       23h
replicaset.apps/strimzi-cluster-operator-7b94df9bb6   0         0         0       6d20h

Install Kafka to elastic namespace:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  labels:
    app.kubernetes.io/instance: kafka
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kafka
    argocd.argoproj.io/instance: kafka
    helm.sh/chart: kafka-3.0.0
  name: REDACTED
  namespace: elastic
spec:
  cruiseControl: {}
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      auto.create.topics.enable: 'false'
      default.replication.factor: 2
      inter.broker.protocol.version: '3.4'
      log.retention.hours: 24
      num.partitions: 3
    jvmOptions:
      '-Xms': 1g
      '-Xmx': 1g
    listeners:
      - authentication:
          type: scram-sha-512
        name: internal
        port: 9092
        tls: false
        type: internal
      - authentication:
          type: scram-sha-512
        configuration:
          bootstrap:
            nodePort: 31094
          preferredNodePortAddressType: InternalIP
        name: external
        port: 9094
        tls: false
        type: nodeport
    livenessProbe:
      initialDelaySeconds: 60
    readinessProbe:
      initialDelaySeconds: 30
    replicas: 2
    resources:
      limits:
        cpu: 4000m
        memory: 32Gi
      requests:
        cpu: 100m
        memory: 4Gi
    storage:
      deleteClaim: true
      size: 32Gi
      type: persistent-claim
    version: 3.4.0
  zookeeper:
    jvmOptions:
      '-Xms': 1g
      '-Xmx': 1g
    replicas: 2
    storage:
      type: ephemeral

Expected behavior

Kafka cluster comes up cleanly, or strimzi logs give more of an indication of what HTTP request is failing.

Strimzi version

0.35.0

Kubernetes version

1.26

Installation method

Helm chart

Infrastructure

Amazon EKS

Configuration files and logs

Logs from kubectl logs --follow --tail=100 --timestamps -n kafka-system deployment.apps/strimzi-cluster-operator are below:

2023-05-19T12:18:34.399145504Z 2023-05-19 12:18:34 INFO  Ca:377 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Generating certificate Subject(organizationName='io.strimzi', commonName='REDACTED-zookeeper', dnsNames=[REDACTED-zookeeper-client, REDACTED-zookeeper-client.elastic.svc.cluster.local, *.REDACTED-zookeeper-nodes.elastic.svc, *.REDACTED-zookeeper-nodes.elastic.svc.cluster.local, *.REDACTED-zookeeper-client.elastic.svc, *.REDACTED-zookeeper-client.elastic.svc.cluster.local, REDACTED-zookeeper-client.elastic, REDACTED-zookeeper-0.REDACTED-zookeeper-nodes.elastic.svc.cluster.local, REDACTED-zookeeper-client.elastic.svc, REDACTED-zookeeper-0.REDACTED-zookeeper-nodes.elastic.svc], ipAddresses=[]), signed by CA cluster-ca
2023-05-19T12:18:34.693563481Z 2023-05-19 12:18:34 INFO  Ca:377 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Generating certificate Subject(organizationName='io.strimzi', commonName='REDACTED-zookeeper', dnsNames=[REDACTED-zookeeper-client, REDACTED-zookeeper-client.elastic.svc.cluster.local, *.REDACTED-zookeeper-nodes.elastic.svc, *.REDACTED-zookeeper-nodes.elastic.svc.cluster.local, REDACTED-zookeeper-1.REDACTED-zookeeper-nodes.elastic.svc.cluster.local, REDACTED-zookeeper-1.REDACTED-zookeeper-nodes.elastic.svc, *.REDACTED-zookeeper-client.elastic.svc, *.REDACTED-zookeeper-client.elastic.svc.cluster.local, REDACTED-zookeeper-client.elastic, REDACTED-zookeeper-client.elastic.svc], ipAddresses=[]), signed by CA cluster-ca
2023-05-19T12:18:35.188271941Z 2023-05-19 12:18:35 INFO  StrimziPodSetController:313 - Reconciliation #3(watch) StrimziPodSet(elastic/REDACTED-zookeeper): StrimziPodSet will be reconciled
2023-05-19T12:19:07.198550138Z 2023-05-19 12:19:07 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace elastic
2023-05-19T12:19:07.199573486Z 2023-05-19 12:19:07 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace kafka-system
2023-05-19T12:19:31.583741502Z 2023-05-19 12:19:31 INFO  AbstractOperator:380 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Reconciliation is in progress
2023-05-19T12:20:31.582950196Z 2023-05-19 12:20:31 INFO  AbstractOperator:380 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Reconciliation is in progress
2023-05-19T12:20:44.597284032Z 2023-05-19 12:20:44 ERROR StrimziPodSetController:344 - Reconciliation #3(watch) StrimziPodSet(elastic/REDACTED-zookeeper): StrimziPodSet REDACTED-zookeeper in namespace elastic reconciliation failed
2023-05-19T12:20:44.598332633Z io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
2023-05-19T12:20:44.599029363Z  at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129) ~[io.fabric8.kubernetes-client-api-6.6.1.jar:?]
2023-05-19T12:20:44.599639413Z  at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122) ~[io.fabric8.kubernetes-client-api-6.6.1.jar:?]
2023-05-19T12:20:44.600219539Z  at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.600823594Z  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1112) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603888329Z  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603910406Z  at io.strimzi.operator.cluster.operator.assembly.StrimziPodSetController.maybeCreateOrPatchPod(StrimziPodSetController.java:405) ~[io.strimzi.cluster-operator-0.35.0.jar:0.35.0]
2023-05-19T12:20:44.603916630Z  at io.strimzi.operator.cluster.operator.assembly.StrimziPodSetController.reconcile(StrimziPodSetController.java:333) ~[io.strimzi.cluster-operator-0.35.0.jar:0.35.0]
2023-05-19T12:20:44.603921337Z  at io.strimzi.operator.cluster.operator.assembly.StrimziPodSetController.run(StrimziPodSetController.java:501) ~[io.strimzi.cluster-operator-0.35.0.jar:0.35.0]
2023-05-19T12:20:44.603925532Z  at java.lang.Thread.run(Thread.java:833) ~[?:?]
2023-05-19T12:20:44.603929922Z Caused by: java.io.IOException: request timed out
2023-05-19T12:20:44.603933548Z  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:514) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603948326Z  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:534) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603952734Z  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:342) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603968066Z  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:707) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603972435Z  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603976524Z  at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42) ~[io.fabric8.kubernetes-client-6.6.1.jar:?]
2023-05-19T12:20:44.603980075Z  ... 6 more
2023-05-19T12:20:44.603983764Z Caused by: java.net.http.HttpTimeoutException: request timed out
2023-05-19T12:20:44.603987646Z  at jdk.internal.net.http.ResponseTimerEvent.handle(ResponseTimerEvent.java:63) ~[java.net.http:?]
2023-05-19T12:20:44.603991598Z  at jdk.internal.net.http.HttpClientImpl.purgeTimeoutsAndReturnNextDeadline(HttpClientImpl.java:1270) ~[java.net.http:?]
2023-05-19T12:20:44.603995183Z  at jdk.internal.net.http.HttpClientImpl$SelectorManager.run(HttpClientImpl.java:899) ~[java.net.http:?]
2023-05-19T12:20:44.620736156Z 2023-05-19 12:20:44 INFO  StrimziPodSetController:349 - Reconciliation #3(watch) StrimziPodSet(elastic/REDACTED-zookeeper): reconciled
2023-05-19T12:20:44.639368768Z 2023-05-19 12:20:44 INFO  StrimziPodSetController:313 - Reconciliation #5(watch) StrimziPodSet(elastic/REDACTED-zookeeper): StrimziPodSet will be reconciled
2023-05-19T12:21:07.198391680Z 2023-05-19 12:21:07 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace elastic
2023-05-19T12:21:07.272661146Z 2023-05-19 12:21:07 INFO  ClusterOperator:139 - Triggering periodic reconciliation for namespace kafka-system
2023-05-19T12:21:31.583891180Z 2023-05-19 12:21:31 INFO  AbstractOperator:380 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Reconciliation is in progress
2023-05-19T12:22:31.583698382Z 2023-05-19 12:22:31 INFO  AbstractOperator:380 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Reconciliation is in progress

...

2023-05-19T12:23:35.213244386Z 2023-05-19 12:23:35 INFO  CrdOperator:133 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Status of Kafka REDACTED in namespace elastic has been updated
2023-05-19T12:23:35.217226932Z 2023-05-19 12:23:35 INFO  OperatorWatcher:38 - Reconciliation #8(watch) Kafka(elastic/REDACTED): Kafka REDACTED in namespace elastic was MODIFIED
2023-05-19T12:23:35.227045312Z 2023-05-19 12:23:35 WARN  AbstractOperator:525 - Reconciliation #1(watch) Kafka(elastic/REDACTED): Failed to reconcile
2023-05-19T12:23:35.227326962Z io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for StrimziPodSet resource REDACTED-zookeeper in namespace elastic to be ready
2023-05-19T12:23:35.227420693Z  at io.strimzi.operator.common.Util$1.lambda$handle$1(Util.java:167) ~[io.strimzi.operator-common-0.35.0.jar:0.35.0]
2023-05-19T12:23:35.228420384Z  at io.vertx.core.impl.future.FutureImpl$3.onFailure(FutureImpl.java:153) ~[io.vertx.vertx-core-4.3.8.jar:4.3.8]
2023-05-19T12:23:35.228459768Z  at io.vertx.core.impl.future.FutureBase.lambda$emitFailure$1(FutureBase.java:69) ~[io.vertx.vertx-core-4.3.8.jar:4.3.8]
2023-05-19T12:23:35.228464866Z  at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
2023-05-19T12:23:35.228467406Z  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
2023-05-19T12:23:35.228469782Z  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
2023-05-19T12:23:35.228472655Z  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[io.netty.netty-transport-4.1.87.Final.jar:4.1.87.Final]
2023-05-19T12:23:35.228474898Z  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
2023-05-19T12:23:35.228477309Z  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
2023-05-19T12:23:35.228479575Z  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.87.Final.jar:4.1.87.Final]
2023-05-19T12:23:35.228482052Z  at java.lang.Thread.run(Thread.java:833) ~[?:?]
2023-05-19T12:23:35.239583770Z 2023-05-19 12:23:35 INFO  AbstractOperator:239 - Reconciliation #8(watch) Kafka(elastic/REDACTED): Kafka REDACTED will be checked for creation or modification
2023-05-19T12:24:06.297926902Z 2023-05-19 12:24:06 INFO  StrimziPodSetController:313 - Reconciliation #9(watch) StrimziPodSet(elastic/REDACTED-zookeeper): StrimziPodSet will be reconciled
2023-05-19T12:24:35.228775986Z 2023-05-19 12:24:35 INFO  AbstractOperator:380 - Reconciliation #8(watch) Kafka(elastic/REDACTED): Reconciliation is in progress

Additional context

We have all TCP traffic allowed between EKS worker nodes.

brsolomon-deloitte commented 1 year ago

A kubectl describe -n elastic kafka showed:

Running ZooKeeper with two nodes is not advisable as both replicas will be needed to avoid downtime. It is recommended that a minimum of three replicas are used.

So, I have updated ZooKeeper replicas to 3. Closing.