kubernetes-csi / csi-driver-iscsi

Apache License 2.0
98 stars 60 forks source link

iscsi csi driver fails to mount LUN in the right location of a replaced pod #274

Open jmrr opened 4 months ago

jmrr commented 4 months ago

What happened:

We're using a Postgresql Bitnami Helm Chart (15.1.4) to run a postgres on a microk8s v1.29 cluster. I wanted to leverage this csi driver for this db storage using an iscsi LUN and target that I created on a QNAP NAS connected over a 10GbE network.

To connect to the LUN, I created a PV + PVC like in the examples and added the PVC as primary.persistence.existingClaim value when deploying the helm chart.

This was working like a charm, at last we could move away from risky storage in the node or slower NFS. However, I replaced the pods of the chart's statefulset to increase its resources, and somehow the csi-iscsi-node didn't mount the target in the right location of the pod's volume.

The outcome (and how we realised): the new location of the volume /var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount wasn't actually a mount of the storage in the NAS, but the node root's filesystem itself! A parallel data ingestion operation consumed the node's storage degrading the node and somewhat the whole cluster as many key workloads got evicted with [DiskPressure] and a taint added to the node

Logs that we encountered:

I0530 23:43:45.773799       1 utils.go:64] GRPC request: {"target_path":"/var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount","volume_id":"iscsi-postgresql-id"}                                                  
I0530 23:43:45.773861       1 mount_linux.go:164] Detected OS without systemd                                                                                                                                                                                                                                                
W0530 23:43:45.777225       1 iscsi_util.go:95] warning: Unmount skipped because path does not exist: /var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount

The Detected OS without systemd message is equally puzzling as we're using Ubuntu 22.04 :thinking: ...

What you expected to happen:

Say original pod volume location was:

/var/snap/microk8s/common/var/lib/kubelet/pods/9cd76fee-cd41-4869-90d2-d46ffedddf68/volumes/kubernetes.io~csi/postgresql/mount -> This was actually the mount point of the filesystem used by the iscsi target.

And the new pod volume location was

/var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount

I would expect the iscsi csi driver node to unmount the target in the first location and re-mount it in the second location, corresponding to the replacement pod, with no data loss.

How to reproduce it:

  1. Create iscsi target + lun
  2. Create pv + pvc like in the driver's examples. e.g. the PersistentVolume manifest:
    kind: PersistentVolume
    metadata:
    name: postgresql-pv
    labels:
    name: postgresql
    spec:
    storageClassName: postgresql-sc
    accessModes:
    - ReadWriteOnce
    capacity:
    storage: 1Gi
    csi:
    driver: iscsi.csi.k8s.io
    volumeHandle: iscsi-postgresql-id
    volumeAttributes:
      targetPortal: "X.X.X.X"
      portals: "[]"
      iqn: "iqn.<redacted>:iscsi.csi.8136ad"
      lun: "1"
      iscsiInterface: "default"
      discoveryCHAPAuth: "true"
      sessionCHAPAuth: "false"
  3. customise and deploy the helm chart for bitnami postgres and select the existing claim created in 2 in the values.yaml
  4. scale the statefulset to 0 and then to 1, or kill the pod which will instruct the statefulset controller to request a new pod to the kube API.

Anything else we need to know?:

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 day ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten