stackabletech / kafka-operator

Stackable Operator for Apache Kafka
Other
24 stars 6 forks source link

Re-Installing Kafka fails - Cluster id doesn't match #609

Closed paulocabrita-ionos closed 1 year ago

paulocabrita-ionos commented 1 year ago

Affected version

23.4.1

Current and expected behavior

Current: the reinstallation of the Kafka fails with the error "The Cluster ID 4a...... doesn't match stored clusterid Some(Qgt.....) in meta.properties. Expected: reinstallation of a Kafka cluster without issue.

Steps done:

Possible solution

No response

Additional context

(With @adwk67 )

Some measures applied:

Related:

Environment

Client Version: v1.25.3 Kustomize Version: v4.5.7 Server Version: v1.25.6

Would you like to work on fixing this bug?

None

adwk67 commented 1 year ago

Reproducable in a local kind cluster (using zookeeper 23.4.0 and bumping kafka from 23.4.0 to 23.4.1):

kind delete clusters --all
kind create cluster
stackablectl op in zookeeper=23.4.0 secret=23.4.0 commons=23.4.0 kafka=23.4.0

# 23.4.0 versions
kubectl apply -f - <<EOF
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperCluster
metadata:
  name: simple-zk
spec:
  image:
    productVersion: 3.8.0
    stackableVersion: 23.4.0
  servers:
    roleGroups:
      primary:
        replicas: 1
        config:
          myidOffset: 10
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-zk-znode
spec:
  clusterRef:
    name: simple-zk
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-kafka-znode
spec:
  clusterRef:
    name: simple-zk
---
apiVersion: kafka.stackable.tech/v1alpha1
kind: KafkaCluster
metadata:
  name: simple-kafka
spec:
  image:
    productVersion: 3.3.1
    stackableVersion: 23.4.0
  clusterConfig:
    zookeeperConfigMapName: simple-kafka-znode
  brokers:
    roleGroups:
      default:
        replicas: 1
EOF

# wait until all pods are ready...

kubectl delete -f - <<EOF
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-kafka-znode
---
apiVersion: kafka.stackable.tech/v1alpha1
kind: KafkaCluster
metadata:
  name: simple-kafka
EOF

# wait until all pods are terminated...

stackablectl op un kafka
# update kafka CRD
kubectl replace -f https://raw.githubusercontent.com/stackabletech/kafka-operator/23.4.1/deploy/helm/kafka-operator/crds/crds.yaml

stackablectl op in kafka=23.4.1

# 23.4.1 version
kubectl apply -f - <<EOF
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-kafka-znode
spec:
  clusterRef:
    name: simple-zk
---
apiVersion: kafka.stackable.tech/v1alpha1
kind: KafkaCluster
metadata:
  name: simple-kafka
spec:
  image:
    productVersion: 3.3.1
    stackableVersion: 23.4.1
  clusterConfig:
    zookeeperConfigMapName: simple-kafka-znode
  brokers:
    roleGroups:
      default:
        replicas: 1
EOF

# end of reproduceable example
# kafka.common.InconsistentClusterIdException: The Cluster ID LrWO5WmGSLKWuKoIyJhvew doesn't match stored clusterId Some(KuPjNmJjQXqWyfA20Ovj2A) in meta.properties.

kubectl delete -f - <<EOF
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-kafka-znode
---
apiVersion: kafka.stackable.tech/v1alpha1
kind: KafkaCluster
metadata:
  name: simple-kafka
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperCluster
metadata:
  name: simple-zk
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-zk-znode
EOF

# now delete the pvcs too:
# data-simple-zk-server-primary-0
# log-dirs-simple-kafka-broker-default-0

kubectl delete pvc data-simple-zk-server-primary-0
kubectl delete pvc log-dirs-simple-kafka-broker-default-0

kubectl apply -f - <<EOF
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperCluster
metadata:
  name: simple-zk
spec:
  image:
    productVersion: 3.8.0
    stackableVersion: 23.4.0
  servers:
    roleGroups:
      primary:
        replicas: 1
        config:
          myidOffset: 10
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-zk-znode
spec:
  clusterRef:
    name: simple-zk
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-kafka-znode
spec:
  clusterRef:
    name: simple-zk
---
apiVersion: kafka.stackable.tech/v1alpha1
kind: KafkaCluster
metadata:
  name: simple-kafka
spec:
  image:
    productVersion: 3.3.1
    stackableVersion: 23.4.1
  clusterConfig:
    zookeeperConfigMapName: simple-kafka-znode
  brokers:
    roleGroups:
      default:
        replicas: 1
EOF
adwk67 commented 1 year ago

The fix above (i.e. everything after # end of reproduceable example) can be simplified to just deleting the kafka cluster and znode, deleting the logs pvc and re-deploying the kafka cluster and znode:

kubectl delete -f - <<EOF
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-kafka-znode
---
apiVersion: kafka.stackable.tech/v1alpha1
kind: KafkaCluster
metadata:
  name: simple-kafka
EOF

# now delete the pvc too: log-dirs-simple-kafka-broker-default-0
kubectl delete pvc log-dirs-simple-kafka-broker-default-0

kubectl apply -f - <<EOF
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-kafka-znode
spec:
  clusterRef:
    name: simple-zk
---
apiVersion: kafka.stackable.tech/v1alpha1
kind: KafkaCluster
metadata:
  name: simple-kafka
spec:
  image:
    productVersion: 3.3.1
    stackableVersion: 23.4.1
  clusterConfig:
    zookeeperConfigMapName: simple-kafka-znode
  brokers:
    roleGroups:
      default:
        replicas: 1
EOF
paulocabrita-ionos commented 1 year ago

I think the PVC is the issue. I deleted the PVC as you said, installed again and... it worked like a charm.

adwk67 commented 1 year ago

/stackable/config/server.properties includes log.dirs=/stackable/data/topicdata and /stackable/data/topicdata contains the /stackable/data/topicdata/meta.properties file with contents such as cluster.id=DN1ekhLQQm68zv7-MsMriw, so it makes sense that deleting the logs pvc would clear the value of cluster.id.

adwk67 commented 1 year ago

To summarise:

Still to do:

adwk67 commented 1 year ago

This issue can be closed as deleting the ZNode (simple-kafka-znode) removes the cluster co-ordination state: if the ZNode is not deleted, the other steps (removing the kafka cluster, upgrading the operator and then deploying the kafka cluster) do not require the logs pvcs to be removed.