hetznercloud / csi-driver

Kubernetes Container Storage Interface driver for Hetzner Cloud Volumes
MIT License
634 stars 102 forks source link

hcloud-csi-node stuck at "ContainerCreating" on k0s #357

Closed mhutter closed 1 year ago

mhutter commented 1 year ago

This was already reported in #260 but was never fixed.

Should I prepare a PR for this?

Steps to reproduce

  1. Set up a k0s cluster
  2. Install the hcloud-csi driver as described in the README (kubectl apply -f https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.1.0/deploy/kubernetes/hcloud-csi.yml)

Expected outcome

The driver starts up

Actual outcome

All hcloud-csi-node pods stuck in ContainerCreating with the following event:

Warning FailedMount 14s (x8 over 77s) kubelet MountVolume.SetUp failed for volume "registration-dir" : hostPath type check failed: /var/lib/kubelet/plugins_registry/ is not a directory

Fix

In the hcloud-csi-node DaemonSet, change hostPath.type of the registration-dir volume to DirectoryOrCreate.

resources:
  - ./token.json  # SealedSecret with the hcloud API token
  - https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.1.0/deploy/kubernetes/hcloud-csi.yml

patchesStrategicMerge:
  - |-
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: hcloud-csi-node
      namespace: kube-system
    spec:
      template:
        spec:
          containers:
            - name: csi-node-driver-registrar
              args:
                - --kubelet-registration-path=/var/lib/k0s/kubelet/plugins/csi.hetzner.cloud/socket
            - name: hcloud-csi-driver
              volumeMounts:
                - name: kubelet-dir
                  mountPath: /var/lib/k0s/kubelet
                  mountPropagation: Bidirectional
          volumes:
            - name: kubelet-dir
              hostPath:
                path: /var/lib/k0s/kubelet
            - name: plugin-dir
              hostPath:
                path: /var/lib/k0s/kubelet/plugins/csi.hetzner.cloud/
            - name: registration-dir
              hostPath:
                path: /var/lib/k0s/kubelet/plugins_registry/
mhutter commented 1 year ago

Digging into this it turns out that while patching the DaemonSet allows the pods to come up, it's not the correct thing to do.

The reason behind is that K0s starts the kubelet with non-standard directories:

/var/lib/k0s/bin/kubelet \
  --cert-dir=/var/lib/k0s/kubelet/pki \
  --container-runtime-endpoint=unix:///run/k0s/containerd.sock \
  --config=/var/lib/k0s/kubelet-config.yaml \
  --kubeconfig=/var/lib/k0s/kubelet.conf \
  --v=1 \
  --containerd=/run/k0s/containerd.sock \
  --node-ip=10.42.0.2 \
  --runtime-cgroups=/system.slice/containerd.service \
  --root-dir=/var/lib/k0s/kubelet \
  --bootstrap-kubeconfig=/var/lib/k0s/kubelet-bootstrap.conf

So the actual fix for K0s would be to change all mounts from /var/lib/kubelet to /var/lib/k0s/kubelet.

However I have no clue how this could be detected....

mhutter commented 1 year ago

My current workaround is to apply the following patch via kustomization:

resources:
  - ./token.json  # SealedSecret with the hcloud API token
  - https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.1.0/deploy/kubernetes/hcloud-csi.yml

patchesStrategicMerge:
  - |-
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: hcloud-csi-node
      namespace: kube-system
    spec:
      template:
        spec:
          volumes:
          - hostPath:
              path: /var/lib/k0s/kubelet
            name: kubelet-dir
          - hostPath:
              path: /var/lib/k0s/kubelet/plugins/csi.hetzner.cloud/
            name: plugin-dir
          - hostPath:
              path: /var/lib/k0s/kubelet/plugins_registry/
            name: registration-dir
apricote commented 1 year ago

AFAICT we can not just change it to DirectoryOrCreate, because the kubelet is activly monitoring to directory, and for k0s it is monitoring the different path. So even though the plugin would startup with DirectoryOrCreate, it would never get registered with the kubelet and you could not mount volumes.

My current workaround is to apply the following patch via kustomization:

resources:
  - ./token.json  # SealedSecret with the hcloud API token
  - https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.1.0/deploy/kubernetes/hcloud-csi.yml

patchesStrategicMerge:
  - |-
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: hcloud-csi-node
      namespace: kube-system
    spec:
      template:
        spec:
          volumes:
          - hostPath:
              path: /var/lib/k0s/kubelet
            name: kubelet-dir
          - hostPath:
              path: /var/lib/k0s/kubelet/plugins/csi.hetzner.cloud/
            name: plugin-dir
          - hostPath:
              path: /var/lib/k0s/kubelet/plugins_registry/
            name: registration-dir

I think this is a great solution. Maybe we should publish a Helm Chart to make configuring such things easier.

mhutter commented 1 year ago

Maybe we should publish a Helm Chart to make configuring such things easier

While researching the issue I found out that this seems to be the way some CSI providers go.

I have also opened https://github.com/k0sproject/k0s/issues/2599 to at least get some documentation on what else there is special about K0s setups...

mhutter commented 1 year ago

So apparently it's a bit more involved to get this even running.

I had to adjust the patch to this to even get the csi-node pods running:

patchesStrategicMerge:
  - |-
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: hcloud-csi-node
      namespace: kube-system
    spec:
      template:
        spec:
          containers:
            - name: csi-node-driver-registrar
              args:
                - --kubelet-registration-path=/var/lib/k0s/kubelet/plugins/csi.hetzner.cloud/socket
          volumes:
          - name: kubelet-dir
            hostPath:
              path: /var/lib/k0s/kubelet
          - name: plugin-dir
            hostPath:
              path: /var/lib/k0s/kubelet/plugins/csi.hetzner.cloud/
          - name: registration-dir
            hostPath:
              path: /var/lib/k0s/kubelet/plugins_registry/

However, it still does not work.

Given the following test manifest:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csi-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: hcloud-volumes

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
        - name: busybox
          image: docker.io/library/busybox
          ports:
            - containerPort: 80
          volumeMounts:
          - mountPath: "/data"
            name: my-csi-volume
          command: [ "sleep", "1000000" ]
      volumes:
        - name: my-csi-volume
          persistentVolumeClaim:
            claimName: csi-pvc

The PVC gets properly bound, the Pod comes up. The volume is attached to the correct server, and the CSI driver reports successful mounting of the volume:

level=info ts=2023-01-16T15:38:34.234862716Z component=linux-mount-service msg="formatting disk" disk=/dev/disk/by-id/scsi-0HC_Volume_26750481 fstype=ext4 level=info ts=2023-01-16T15:38:34.792342371Z component=linux-mount-service msg="publishing volume" target-path=/var/lib/k0s/kubelet/pods/1427828c-0c03-419a-bdd2-8cd1d31c86af/volumes/kubernetes.io~csi/pvc-01560db4-164f-4e1d-b24d-361890a5ff84/mount device-path=/dev/disk/by-id/scsi-0HC_Volume_26750481 fs-type=ext4 block-volume=false readonly=false mount-options= encrypted=false

Even the syslog mentions that the thing was mounted:

[root@worker-i1ht ~]# journalctl -xe | grep sdb
Jan 16 15:38:29 worker-i1ht kernel: sd 0:0:0:1: [sdb] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
Jan 16 15:38:29 worker-i1ht kernel: sd 0:0:0:1: [sdb] Write Protect is off
Jan 16 15:38:29 worker-i1ht kernel: sd 0:0:0:1: [sdb] Mode Sense: 63 00 00 08
Jan 16 15:38:29 worker-i1ht kernel: sd 0:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 16 15:38:29 worker-i1ht kernel: sd 0:0:0:1: [sdb] Attached SCSI disk
Jan 16 15:38:34 worker-i1ht kernel: EXT4-fs (sdb): mounted filesystem with ordered data mode. Quota mode: none.

but.... it is not:

[root@worker-i1ht ~]# mount | grep ^/
/dev/sda1 on / type ext4 (rw,relatime,seclabel)
/dev/sda14 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro)

Maybe related: #343

When execing into the pod, df reports that the servers root disk is mounted at /data. Writing to /data works, but will prevent the pod from being terminated properly: The cleanup process tries to remove /var/lib/k0s/kubelet/pods/1427828c-0c03-419a-bdd2-8cd1d31c86af/volumes/kubernetes.io~csi/pvc-01560db4-164f-4e1d-b24d-361890a5ff84/mount but fails because it is not empty (it contains the written data).

I'm a bit at loss about how to troubleshoot this further, or what needs to be fixed to get this working with K0s

mhutter commented 1 year ago

Turns out I messed up some mount paths & containers. Now that all is fixed, it works! I added the required patches to the issue description.

apricote commented 1 year ago

Opened #369 for the helm chart.

Thanks @mhutter for providing the required patches for current users of k0s!