pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
369 stars 203 forks source link

PersistentVolumeClaimSpec.AccessModes in Zookeeper-operator is ineffective #517

Open hoyhbx opened 1 year ago

hoyhbx commented 1 year ago

Description

(Describe the feature, bug, question, proposal that you are requesting) We found that zookeeper pod keeps crashing after a scale-down workload and a scale-up workload. We found the crash is because the new pods are reusing the undeleted PVC during the scale-down process. However, this problem still persists even when we specify the reclaimPolicy: Delete.

The root cause of this issue is because there is no guard for scale-up before finishing deleting the PVC. We found that zookeeper-operator checks if the upgrade fails before updating the statefulSet, but this check misses checking for the undeleted OrphanPVCs.

The orphaned PVCs will never get deleted because the operator always waits for the number of ready replicas to equal to desired replicas, which will never happen since one pod keeps crashing.

Importance

(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have)) Importance: should-have

Location

(Where is the piece of code, package, or document affected by this issue?) PVC's AccessModes are always overrided by the predefined ReadWriteOnce https://github.com/pravega/zookeeper-operator/blob/c29d475794226ce38de021d72430ccde75a38d2a/api/v1beta1/zookeepercluster_types.go#L758

Suggestions for an improvement

PersistentVolumeClaimSpec.AccessModes as an exposed field should allow users to customize the AccessModes of the PVC.

If for any reason zookeeper can only use the ReadWriteOnce mode, this field should not be exposed to users

(How do you suggest to fix or proceed with this issue?)