[BUG] Error mounting pvc: Incorrect NFS options

TheDevilOnLine commented 3 months ago

Describe the bug

The PVC fails to mount, with the following error in the pods events:

Warning  FailedMount  5s (x350 over 12h)  kubelet  MountVolume.MountDevice failed for volume "pvc-c777ad17-267c-47f2-9590-89ef39278d8e" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: /usr/local/sbin/nsmounter
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,sof 10.43.147.182:/pvc-c777ad17-267c-47f2-9590-89ef39278d8e /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount: bad option; for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount.<type> helper program.

this is caused by the soft option for NFS to be cut short to sof

To Reproduce

Install the latest version of longhorn and mount a PVC

Expected behavior

Have the soft option set correctly and mount the pvc.

Support bundle for troubleshooting

supportbundle_f1466c3d-47f3-4e9a-8b6e-f014a12bb1ca_2024-08-27T07-41-49Z.zip

Environment

Longhorn version: 1.5.2
Impacted volume (PV): img-rwm
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: Ubuntu 22.04
- Kernel version: 5.15.0-119-generic
- CPU per node: 2x 4 cores
- Memory per node: 64 GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 13

james-munson commented 3 months ago

This sounds a lot like https://github.com/longhorn/longhorn/issues/7301. Checking the support bundle to look for overridden nfsOptions for mounting. I don't recall that the typo is built into a release, but if it is, a workaround might be to specify custom options, or possibly an upgrade.

james-munson commented 3 months ago

The default longhorn storageclass with no overridden nfsOptions parameters is the only one in use, so mount options would be code defaults. Looking at PR https://github.com/longhorn/longhorn-manager/pull/2293/files, there was a period where the typo was in master branch code. It was released in v1.5.2, but even in that release, it should retry on failure with "soft". The typo was corrected in v1.5.3, with a default of "softerr", and a fallback to "soft" if that failed.

james-munson commented 3 months ago

Looking in the csi logs, the retry with "soft" is successful. First the error:

2024-02-01T03:00:02.693703334+01:00 time="2024-02-01T02:00:02Z" level=warning msg="Failed to mount volume pvc-c777ad17-267c-47f2-9590-89ef39278d8e with default mount options, retrying with soft mount" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:249" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,sof 10.43.147.182:/pvc-c777ad17-267c-47f2-9590-89ef39278d8e /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount\nOutput: mount.nfs: an incorrect mount option was specified\n"

And then the successful retry:

2024-02-01T03:00:02.737648592+01:00 time="2024-02-01T02:00:02Z" level=info msg="Mounted shared volume pvc-c777ad17-267c-47f2-9590-89ef39278d8e on node k3s-103 via share endpoint nfs://10.43.147.182/pvc-c777ad17-267c-47f2-9590-89ef39278d8e" func="csi.(*NodeServer).NodeStageVolume" file="node_server.go:401" component=csi-node-server function=NodeStageVolume

The CSI goes on to publish and check it

2024-02-01T03:00:02.737662412+01:00 time="2024-02-01T02:00:02Z" level=info msg="NodeStageVolume: rsp: {}" func=csi.logGRPC file="server.go:141"
2024-02-01T03:00:02.743293995+01:00 time="2024-02-01T02:00:02Z" level=info msg="NodePublishVolume: req: {\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount\",\"target_path\":\"/var/lib/kubelet/pods/dba7f05d-f36d-425b-ae6d-31947e85cc87/volumes/kubernetes.io~csi/pvc-c777ad17-267c-47f2-9590-89ef39278d8e/mount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"csi.storage.k8s.io/ephemeral\":\"false\",\"csi.storage.k8s.io/pod.name\":\"replay-28445880-shl5z\",\"csi.storage.k8s.io/pod.namespace\":\"onkp\",\"csi.storage.k8s.io/pod.uid\":\"dba7f05d-f36d-425b-ae6d-31947e85cc87\",\"csi.storage.k8s.io/serviceAccount.name\":\"default\",\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1699442929278-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-c777ad17-267c-47f2-9590-89ef39278d8e\"}" func=csi.logGRPC file="server.go:132"
2024-02-01T03:00:02.747870728+01:00 time="2024-02-01T02:00:02Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount" func=csi.ensureMountPoint file="util.go:288"
2024-02-01T03:00:02.747926483+01:00 time="2024-02-01T02:00:02Z" level=info msg="Mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount try opening and syncing dir to make sure it's healthy" func=csi.ensureMountPoint file="util.go:296"
2024-02-01T03:00:02.842909102+01:00 time="2024-02-01T02:00:02Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/pods/dba7f05d-f36d-425b-ae6d-31947e85cc87/volumes/kubernetes.io~csi/pvc-c777ad17-267c-47f2-9590-89ef39278d8e/mount" func=csi.ensureMountPoint file="util.go:288"
2024-02-01T03:00:02.890631276+01:00 time="2024-02-01T02:00:02Z" level=info msg="NodePublishVolume: rsp: {}" func=csi.logGRPC file="server.go:141"

And it is up until unmounted hours later:

2024-02-01T08:24:59.777344266+01:00 time="2024-02-01T07:24:59Z" level=info msg="NodeUnpublishVolume: req: {\"target_path\":\"/var/lib/kubelet/pods/dba7f05d-f36d-425b-ae6d-31947e85cc87/volumes/kubernetes.io~csi/pvc-c777ad17-267c-47f2-9590-89ef39278d8e/mount\",\"volume_id\":\"pvc-c777ad17-267c-47f2-9590-89ef39278d8e\"}" func=csi.logGRPC file="server.go:132"

longhorn / longhorn