[Bug] Zookeeper getting into CrashLoopBackOff with error - java.io.IOException: keystore password was incorrect

sreenureddyy commented 4 years ago

Describe the bug Trying to bring up the kafka operator and I am running into zookeeper CrashLoopBackOff

To Reproduce

Followed the quickstarts guide of Kubernetes Kind from - https://strimzi.io/quickstarts/

Created Custom Resource
kubectl apply -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
Apply Kafka operator kubectl apply -f kafka-operator.yaml -n kafka

Get status of Kafka operator

 # kubectl get all -n kafka
 NAME                                            READY   STATUS             RESTARTS   AGE
 pod/my-cluster-zookeeper-0                      0/1     CrashLoopBackOff   4          6m1s
 pod/my-cluster-zookeeper-1                      0/1     CrashLoopBackOff   4          6m1s
 pod/my-cluster-zookeeper-2                      0/1     CrashLoopBackOff   4          6m1s
pod/strimzi-cluster-operator-7d6cd6bdf7-zkh6z   1/1     Running            0          21m

zookeeper logs showing below error

 # kubectl logs -f pod/my-cluster-zookeeper-0 -n kafka
 Detected Zookeeper ID 1
 Preparing truststore
 Adding /opt/kafka/cluster-ca-certs/ca.crt to truststore /tmp/zookeeper/cluster.truststore.p12 with alias ca
 keytool error: java.io.IOException: keystore password was incorrect

Environment:

Strimzi version: Latest
Installation method: YAML files

Kubernetes cluster:

# kubectl version

Client Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.2-6+38ac483e736488", 
GitCommit:"38ac483e736488517dd754156441b89e0b2060e2", GitTreeState:"clean", BuildDate:"2020-07- 
 07T13:54:17Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.2-6+38ac483e736488", 
GitCommit:"38ac483e736488517dd754156441b89e0b2060e2", GitTreeState:"clean", BuildDate:"2020-07- 
07T13:51:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Infrastructure: VMware supervisor cluster

YAML files

Custom resources yaml file- https://strimzi.io/install/latest?namespace=kafka Yaml file to deploy Kafka

kafka-operator.yaml

   apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: my-cluster
spec:
  cruiseControl: {}
  kafka:
    version: 2.5.0
    replicas: 3
    listeners:
      plain: {}
      tls: {}
      external:
        type: loadbalancer
        tls: false
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      log.message.format.version: "2.5"
    storage:
      type: ephemeral
  zookeeper:
    replicas: 3
    storage:
      type: ephemeral
  entityOperator:
    topicOperator: {}
    userOperator: {}

scholzj commented 4 years ago

I cannot reproduce this. Have you tried to just delete the cluster and create a new one?

sreenureddyy commented 4 years ago

I tried to delete the cluster and recreated a new one. Still getting the same error. I assume the secrets may be created with corrupted encoding, correct me if I am wrong.

scholzj commented 4 years ago

I don't think there are any encodings which can cause this. The password would be autogenerated string inside the pod. This looks more like the file already exists inside the pod from some previous run or another pod. You say that your infra is VMware supervisor cluster. But what does that mean? Is that some sort of VMWare Kubernetes distribution? It looks to me like the /tmp folder on your Kube cluster are not isolated but are shared.

sreenureddyy commented 4 years ago

Thank you scholzj. Yes, the infra I am using is VMWare Kubernetes distribution. I will check the/tmp folder

erickvica commented 3 years ago

@sreenureddyy How did you solve this problem at the end? I'm getting the same issue using a similar infra

sreenureddyy commented 3 years ago

@erickvica my issue is resolved. I followed Kubernetes Kind from https://strimzi.io/quickstarts/

erickvica commented 3 years ago

Mmm OK, for me something strange happens I'm using Minikube instead from the quickstart guide, but the exercise works fine in my local PC but not in a VM of Vmware, thanks.

sreenureddyy commented 3 years ago

@erickvica - I remember I did few changes to Kafka CRD file to work on VMware k8s dist(wcp -workload control plane). Follow the procedure create-kafka-operator.sh file from my repo https://github.com/ysree/strimzi-kafka-operator. Don't run the last three steps if you are not using fluentbit

erickvica commented 3 years ago

Thanks for the help

jrivers96 commented 3 years ago

I'm seeing this on: Kafka connect pod EKS 1.18

( Edited and removed strimzi 0.25 - we are on earlier version of strimzi just for the problematic connector)

scholzj commented 3 years ago

@jrivers96 This was a problem in code with much older Strimzi versions. Are you sure you are using Strimzi 0.25? With Kafka Connect, you have to make sure that it is not just the operator from Strimzi 0.25 but that also the Connect image with connectors is from Strimzi 0.25. The other option is what seemed to be the case from this issue that it is cause by some infrastructure.

jrivers96 commented 3 years ago

I've seen us get misconfigured EKS boxes that have hypervisor isolation problems in our dev environment very rarely. Some of what you wrote above related to the /tmp folder being shared seems to line up with that. This was the first occurrence in our production system though.

scholzj commented 3 years ago

When the problem happened due to the Strimzi code the logic was following:

The emptyDir volume used for the /tmp survives the container restart (in Kube, container restart is different from pod being deleted and scheduled again in this behaviour). That means that the files from previous run will be preserved. So after the restart the keystore was still there with different password
- This should be fixed already for some time, but the fix is inside the Connect image / base image. So it is important to use the correct version of it. In the past we saw many times that users run into this with a seemingly new version but it turned out they used an old base image.
- It might be that the fix forgot some certificate => in that case we would need to see the full log and your custom resource to identify which one it is
- Assuming it is the same problem, the password issue always happens after the first restart. So the initial startup after you delete the pod has to fail with something else. If it doesn't, then this is a different issue from what we had in the past

Assuming it is none of the above, it has to be something to the infra / storage. But I'm afraid I have no idea what.

jrivers96 commented 3 years ago

Yeah, apologies. Our connector is still on an old version of strimzi. It's not the infra problem I've seen, but looked like it.

scholzj commented 2 years ago

Triaged on 7th July 2022: Should be fixed in latest versions. The bug was present after the move to using emoty-dir volume for /tmp which get not deleted when container restarts. That should be fixed now for a long time. This can be closed.

strimzi / strimzi-kafka-operator

[Bug] Zookeeper getting into CrashLoopBackOff with error - java.io.IOException: keystore password was incorrect #3657