strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.8k stars 1.28k forks source link

[Bug] Zookeeper getting into CrashLoopBackOff with error - java.io.IOException: keystore password was incorrect #3657

Closed sreenureddyy closed 2 years ago

sreenureddyy commented 4 years ago

Describe the bug Trying to bring up the kafka operator and I am running into zookeeper CrashLoopBackOff

To Reproduce

Followed the quickstarts guide of Kubernetes Kind from - https://strimzi.io/quickstarts/

  1. Created Custom Resource
    kubectl apply -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
  2. Apply Kafka operator kubectl apply -f kafka-operator.yaml -n kafka
  3. Get status of Kafka operator
     # kubectl get all -n kafka
     NAME                                            READY   STATUS             RESTARTS   AGE
     pod/my-cluster-zookeeper-0                      0/1     CrashLoopBackOff   4          6m1s
     pod/my-cluster-zookeeper-1                      0/1     CrashLoopBackOff   4          6m1s
     pod/my-cluster-zookeeper-2                      0/1     CrashLoopBackOff   4          6m1s
    pod/strimzi-cluster-operator-7d6cd6bdf7-zkh6z   1/1     Running            0          21m
  4. zookeeper logs showing below error
     # kubectl logs -f pod/my-cluster-zookeeper-0 -n kafka
     Detected Zookeeper ID 1
     Preparing truststore
     Adding /opt/kafka/cluster-ca-certs/ca.crt to truststore /tmp/zookeeper/cluster.truststore.p12 with alias ca
     keytool error: java.io.IOException: keystore password was incorrect

Environment:

YAML files

Custom resources yaml file- https://strimzi.io/install/latest?namespace=kafka Yaml file to deploy Kafka

kafka-operator.yaml

   apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: my-cluster
spec:
  cruiseControl: {}
  kafka:
    version: 2.5.0
    replicas: 3
    listeners:
      plain: {}
      tls: {}
      external:
        type: loadbalancer
        tls: false
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      log.message.format.version: "2.5"
    storage:
      type: ephemeral
  zookeeper:
    replicas: 3
    storage:
      type: ephemeral
  entityOperator:
    topicOperator: {}
    userOperator: {}
scholzj commented 4 years ago

I cannot reproduce this. Have you tried to just delete the cluster and create a new one?

sreenureddyy commented 4 years ago

I tried to delete the cluster and recreated a new one. Still getting the same error. I assume the secrets may be created with corrupted encoding, correct me if I am wrong.

scholzj commented 4 years ago

I don't think there are any encodings which can cause this. The password would be autogenerated string inside the pod. This looks more like the file already exists inside the pod from some previous run or another pod. You say that your infra is VMware supervisor cluster. But what does that mean? Is that some sort of VMWare Kubernetes distribution? It looks to me like the /tmp folder on your Kube cluster are not isolated but are shared.

sreenureddyy commented 4 years ago

Thank you scholzj. Yes, the infra I am using is VMWare Kubernetes distribution. I will check the/tmp folder

erickvica commented 3 years ago

@sreenureddyy How did you solve this problem at the end? I'm getting the same issue using a similar infra

sreenureddyy commented 3 years ago

@erickvica my issue is resolved. I followed Kubernetes Kind from https://strimzi.io/quickstarts/

erickvica commented 3 years ago

Mmm OK, for me something strange happens I'm using Minikube instead from the quickstart guide, but the exercise works fine in my local PC but not in a VM of Vmware, thanks.

sreenureddyy commented 3 years ago

@erickvica - I remember I did few changes to Kafka CRD file to work on VMware k8s dist(wcp -workload control plane). Follow the procedure create-kafka-operator.sh file from my repo https://github.com/ysree/strimzi-kafka-operator. Don't run the last three steps if you are not using fluentbit

erickvica commented 3 years ago

Thanks for the help

jrivers96 commented 3 years ago

I'm seeing this on: Kafka connect pod EKS 1.18

( Edited and removed strimzi 0.25 - we are on earlier version of strimzi just for the problematic connector)

scholzj commented 3 years ago

@jrivers96 This was a problem in code with much older Strimzi versions. Are you sure you are using Strimzi 0.25? With Kafka Connect, you have to make sure that it is not just the operator from Strimzi 0.25 but that also the Connect image with connectors is from Strimzi 0.25. The other option is what seemed to be the case from this issue that it is cause by some infrastructure.

jrivers96 commented 3 years ago

I've seen us get misconfigured EKS boxes that have hypervisor isolation problems in our dev environment very rarely. Some of what you wrote above related to the /tmp folder being shared seems to line up with that. This was the first occurrence in our production system though.

scholzj commented 3 years ago

When the problem happened due to the Strimzi code the logic was following:

Assuming it is none of the above, it has to be something to the infra / storage. But I'm afraid I have no idea what.

jrivers96 commented 3 years ago

Yeah, apologies. Our connector is still on an old version of strimzi. It's not the infra problem I've seen, but looked like it.

scholzj commented 2 years ago

Triaged on 7th July 2022: Should be fixed in latest versions. The bug was present after the move to using emoty-dir volume for /tmp which get not deleted when container restarts. That should be fixed now for a long time. This can be closed.