iscsi csi driver fails to mount LUN in the right location of a replaced pod

jmrr commented 4 months ago

What happened:

We're using a Postgresql Bitnami Helm Chart (15.1.4) to run a postgres on a microk8s v1.29 cluster. I wanted to leverage this csi driver for this db storage using an iscsi LUN and target that I created on a QNAP NAS connected over a 10GbE network.

To connect to the LUN, I created a PV + PVC like in the examples and added the PVC as primary.persistence.existingClaim value when deploying the helm chart.

This was working like a charm, at last we could move away from risky storage in the node or slower NFS. However, I replaced the pods of the chart's statefulset to increase its resources, and somehow the csi-iscsi-node didn't mount the target in the right location of the pod's volume.

The outcome (and how we realised): the new location of the volume /var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount wasn't actually a mount of the storage in the NAS, but the node root's filesystem itself! A parallel data ingestion operation consumed the node's storage degrading the node and somewhat the whole cluster as many key workloads got evicted with [DiskPressure] and a taint added to the node

Logs that we encountered:

I0530 23:43:45.773799       1 utils.go:64] GRPC request: {"target_path":"/var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount","volume_id":"iscsi-postgresql-id"}                                                  
I0530 23:43:45.773861       1 mount_linux.go:164] Detected OS without systemd                                                                                                                                                                                                                                                
W0530 23:43:45.777225       1 iscsi_util.go:95] warning: Unmount skipped because path does not exist: /var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount

The Detected OS without systemd message is equally puzzling as we're using Ubuntu 22.04 :thinking: ...

What you expected to happen:

Say original pod volume location was:

/var/snap/microk8s/common/var/lib/kubelet/pods/9cd76fee-cd41-4869-90d2-d46ffedddf68/volumes/kubernetes.io~csi/postgresql/mount -> This was actually the mount point of the filesystem used by the iscsi target.

And the new pod volume location was

/var/snap/microk8s/common/var/lib/kubelet/pods/b88fdaea-a22e-42ac-90ae-d71f927dc300/volumes/kubernetes.io~csi/postgresql/mount

I would expect the iscsi csi driver node to unmount the target in the first location and re-mount it in the second location, corresponding to the replacement pod, with no data loss.

How to reproduce it:

Create iscsi target + lun

Create pv + pvc like in the driver's examples. e.g. the PersistentVolume manifest:

kind: PersistentVolume
metadata:
name: postgresql-pv
labels:
name: postgresql
spec:
storageClassName: postgresql-sc
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
csi:
driver: iscsi.csi.k8s.io
volumeHandle: iscsi-postgresql-id
volumeAttributes:
  targetPortal: "X.X.X.X"
  portals: "[]"
  iqn: "iqn.<redacted>:iscsi.csi.8136ad"
  lun: "1"
  iscsiInterface: "default"
  discoveryCHAPAuth: "true"
  sessionCHAPAuth: "false"

customise and deploy the helm chart for bitnami postgres and select the existing claim created in 2 in the values.yaml
scale the statefulset to 0 and then to 1, or kill the pod which will instruct the statefulset controller to request a new pod to the kube API.

Anything else we need to know?:

I've since removed the postgres chart, but I can still see warning: Unmount skipped because path does not exist messages in the node logs. Environment:
CSI Driver version: commit hash: 554efb1
Kubernetes version (use kubectl version): v1.29.4
OS (e.g. from /etc/os-release): Ubuntu 22.04.3 LTS
Kernel (e.g. uname -a): 5.15.0-105-generic
Install tools: open-iscsi
Others: microk8s v1.29

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 day ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-csi / csi-driver-iscsi

iscsi csi driver fails to mount LUN in the right location of a replaced pod #274