strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.89k stars 1.31k forks source link

[Bug] Kafka cluster deployment is failing with error Failed to connect to Zookeeper #3692

Closed nuthanbn closed 4 years ago

nuthanbn commented 4 years ago

Describe the bug I am trying to deploy the Strimzi-Kafka solution on Kubernetes using helm3. A simple Kafka cluster deployment with replica count as 1 has no issues but on the increase of replica count to 3 the Kafka cluster deployment is failing with the following error. Below is the output of "kubectl -n kafka-poc describe kafka nuthan-kafka"

I have tried to deploy Strimzi cluster operator 0.19.0 and 0.18.0 but am seeing a similar issue.

Status: Conditions: Last Transition Time: 2020-09-22T18:50:41+0000 Message: Failed to connect to Zookeeper nuthan-kafka-zookeeper-0.nuthan-kafka-zookeeper-nodes.kafka-poc.svc:2181,nuthan-kafka-zookeeper-1.nuthan-kafka-zookeeper-nodes.kafka-poc.svc:2181,nuthan-kafka-zookeeper-2.nuthan-kafka-zookeeper-nodes.kafka-poc.svc:2181. Connection was not ready in 300000 ms. Reason: ZookeeperScalingException Status: True Type: NotReady Observed Generation: 1 Events:

To Reproduce Steps to reproduce the behavior:

  1. Create new namespace called strimzi-poc for installing strimzi operator

  2. Create new namespace called kafka-poc for deploying kafka cluster

  3. Install strimzi operator using helm3 Install Strimzi cluster operator using helm3 helm repo add strimzi https://strimzi.io/charts/ helm inspect values strimzi/strimzi-kafka-operator > values.yaml helm -n strimzi-poc install strimzi strimzi/strimzi-kafka-operator --values values.yaml

  4. Deploy the kafka cluster in kafka-poc namespace. apiVersion: kafka.strimzi.io/v1beta1 kind: Kafka metadata: name: nuthan-kafka namespace: kafka-poc spec: kafka: replicas: 3 version: 2.4.1 listeners: plain: {} tls: {}
    config: offsets.topic.replication.factor: 3 transaction.state.log.replication.factor: 3 transaction.state.log.min.isr: 2 log.message.format.version: "2.4" storage: type: ephemeral zookeeper: livenessProbe: initialDelaySeconds: 60 timeoutSeconds: 5 readinessProbe: initialDelaySeconds: 60 timeoutSeconds: 5 replicas: 3 storage: type: ephemeral entityOperator: topicOperator: {} userOperator: {}

  5. Zookeeper will be deployed successfully and Kafka cluster deployment will be failed after 5min with above mentioned error.

Every 2.0s: kubectl -n kafka-poc get all Tue Sep 22 13:20:04 2020

NAME READY STATUS RESTARTS AGE pod/nuthan-kafka-zookeeper-0 1/1 Running 0 35m pod/nuthan-kafka-zookeeper-1 1/1 Running 0 35m pod/nuthan-kafka-zookeeper-2 1/1 Running 0 35m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/nuthan-kafka-zookeeper-client ClusterIP 10.233.30.236 2181/TCP 35m service/nuthan-kafka-zookeeper-nodes ClusterIP None 2181/TCP,2888/TCP,3888/TCP 35m

NAME READY AGE statefulset.apps/nuthan-kafka-zookeeper 3/3 35m

Expected behavior After successful deployment of zookeeper a kafka cluster has to deployed based on yaml definition and replica count

Environment:

nuthanbn commented 4 years ago

I could find a similar issue logged on the openshift environment. https://github.com/strimzi/strimzi-kafka-operator/issues/3616

scholzj commented 4 years ago

Can you share the logs from the Cluster Operator and Zookeeper pods?

nuthanbn commented 4 years ago

please find the attached kafka cluster yaml, helm values, operator and zookeeper logs strimzi-operator.log zookeeper.log kafka-cluster.yaml.log values.yaml.log

nuthanbn commented 4 years ago

@scholzj Thank you for the response. I was able to resolve this issue by specifying the below properties in the zookeeper YAML configuration. Also ran the quick test to produce and consume the message from the Kafka topic. It's working as expected.

    jvmOptions:
      javaSystemProperties:
        - name: zookeeper.ssl.hostnameVerification
          value: "false"
        - name: zookeeper.ssl.quorum.hostnameVerification
          value: "false"
scholzj commented 4 years ago

Looks like you solved it before I managed to get to it. Great :-D.

lanzhiwang commented 3 years ago

I want to ask why Kafka can connect to zookeeper through localhost:2181,Kafka and zookeeper are in different pods, I understand that they cannot be connected directly through localhost

lanzhiwang commented 3 years ago

Is it related to stunnel?

scholzj commented 3 years ago

In the old versions when ZooKeeper didn't know TLS, Strimzi was using the TLS sidecars based on Stunnel. Kafka talked with the sidecar on localhost, where the Stunnel took the connection and encrypted it and passed to the Stunnel sidecar in the ZooKeeper pods.

lanzhiwang commented 3 years ago

Why is this done? Are there any considerations? Why not connect directly through the services associated with zk?

scholzj commented 3 years ago

Why is this done?

To secure Kafka and ZooKeeper.

Are there any considerations?

Considerations about what?

Why not connect directly through the services associated with zk?

It connects using the ZooKeeper services.

martinep1 commented 2 years ago

@scholzj @nuthanbn Which zookeeper YAML did you update to resolve issue and where is it located? https://github.com/strimzi/strimzi-kafka-operator/issues/3692#issuecomment-696952238

scholzj commented 2 years ago

It is in the Kafka CR. In .spec.zookeeper.