`bash: /stackable/data/myid: Permission denied`

Nicklason commented 1 year ago

Affected version

23.7.0

Current and expected behavior

Steps to reproduce:

Start minikube minikube start
Install operators (hdfs-operator, zookeeper-operator, commons-operator, secret-operator)
Cordon node kubectl cordon minikube
Add a worker node minikube node add
Apply ZooKeeperCluster resource (see below)
Pod status Init:CrashLoopBackOff

I apply a basic ZookeeperCluster resource. It creates one pod but the init container crashes and the pod is stuck in Init:CrashLoopBackOff.

Possible solution

If the zookeeper pod runs on the "minikube" node (the controlplane node) then it works, but if it runs on a worker node then it does not work.

This issue may be related to #357.

Additional context

ZooKeeper cluster:

apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperCluster
metadata:
  name: simple-zk
spec:
  image:
    productVersion: 3.8.0
    stackableVersion: 23.7.0
  servers:
    roleGroups:
      default:
        replicas: 1

Error:

kubectl logs pods/simple-zk-server-default-0 prepare -f
copying /stackable/config to /stackable/rwconfig
bash: /stackable/data/myid: Permission denied

Environment

minikube cluster using docker driver with one control plane node and one worker node.

Kubernetes v1.27.4 minikube: v1.31.2 zookeeper-operator: v23.7.0

Would you like to work on fixing this bug?

None

lfrancke commented 1 year ago

First of all thank you for the report and the detailed steps to reproduce this. We will take a look tomorrow.

soenkeliebau commented 1 year ago

Hi @Nicklason, this looks like it relates to a known limitation of the minikube storage implementation. I found https://github.com/kubernetes/minikube/issues/12360 which sounds very related. Apparently the storage implementation in minikube is really simple and may not work for edge cases like multi node clusters.

I was able to reproduce your issue and tried the fix mentioned in a comment and that seemed to fix it for me.

Basically, before deploying your ZooKeeper object run the following:

minikube addons disable storage-provisioner
minikube addons disable default-storageclass
minikube addons enable volumesnapshots
minikube addons enable csi-hostpath-driver
kubectl patch storageclass csi-hostpath-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

One thing to be aware of, for me the csi driver bound the volume to the cordoned node, which had the pod stuck in "pending" because no node was available to schedule it to. Deleting the PVC fixed that for me, as it was recreated and put on the non cordoned node.

lfrancke commented 1 year ago

If this fixes it we should document this.

soenkeliebau commented 1 year ago

In principle I agree, and for this specific thing happy to add it somewhere, however this will probably be the start of an entire section "known shortcomings of various kubernetes distros" in our documentation that can easily become a bottomless pit :)

maltesander commented 1 year ago

Yeah a section like that could help but will be hard to maintain.... What about we rather adapt the issue template and add a hint to e.g. "Did you try to reproduce the issue locally on Kind / K3s?" since i assume this is what we mostly use locally to test? Then its easier to determine if its a bug from our side or any kubernetes distro?

lfrancke commented 1 year ago

We already have this for other distros: https://docs.stackable.tech/home/nightly/secret-operator/installation.html#_huawei_cloud

Nicklason commented 1 year ago

@soenkeliebau Thanks a lot for the quick help. I just followed the steps you provided and that resulted in the pod starting properly. Just as a note the patch command to make the csi provisioner storage class the default did not work, looks like GitHub formatted it as a link.

maltesander commented 1 year ago

@soenkeliebau Thanks a lot for the quick help. I just followed the steps you provided and that resulted in the pod starting properly. Just as a note the patch command to make the csi provisioner storage class the default did not work, looks like GitHub formatted it as a link.

Thanks for the hint with the github link @Nicklason. I adapted @soenkeliebau comment. Closing this.

stackabletech / zookeeper-operator