Open TheDevilOnLine opened 3 months ago
This sounds a lot like https://github.com/longhorn/longhorn/issues/7301. Checking the support bundle to look for overridden nfsOptions for mounting. I don't recall that the typo is built into a release, but if it is, a workaround might be to specify custom options, or possibly an upgrade.
The default longhorn
storageclass with no overridden nfsOptions
parameters is the only one in use, so mount options would be code defaults.
Looking at PR https://github.com/longhorn/longhorn-manager/pull/2293/files, there was a period where the typo was in master branch code. It was released in v1.5.2, but even in that release, it should retry on failure with "soft".
The typo was corrected in v1.5.3, with a default of "softerr", and a fallback to "soft" if that failed.
Looking in the csi logs, the retry with "soft" is successful. First the error:
2024-02-01T03:00:02.693703334+01:00 time="2024-02-01T02:00:02Z" level=warning msg="Failed to mount volume pvc-c777ad17-267c-47f2-9590-89ef39278d8e with default mount options, retrying with soft mount" func="csi.(*NodeServer).nodeStageSharedVolume" file="node_server.go:249" component=csi-node-server error="mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,timeo=600,retrans=5,sof 10.43.147.182:/pvc-c777ad17-267c-47f2-9590-89ef39278d8e /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount\nOutput: mount.nfs: an incorrect mount option was specified\n"
And then the successful retry:
2024-02-01T03:00:02.737648592+01:00 time="2024-02-01T02:00:02Z" level=info msg="Mounted shared volume pvc-c777ad17-267c-47f2-9590-89ef39278d8e on node k3s-103 via share endpoint nfs://10.43.147.182/pvc-c777ad17-267c-47f2-9590-89ef39278d8e" func="csi.(*NodeServer).NodeStageVolume" file="node_server.go:401" component=csi-node-server function=NodeStageVolume
The CSI goes on to publish and check it
2024-02-01T03:00:02.737662412+01:00 time="2024-02-01T02:00:02Z" level=info msg="NodeStageVolume: rsp: {}" func=csi.logGRPC file="server.go:141"
2024-02-01T03:00:02.743293995+01:00 time="2024-02-01T02:00:02Z" level=info msg="NodePublishVolume: req: {\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount\",\"target_path\":\"/var/lib/kubelet/pods/dba7f05d-f36d-425b-ae6d-31947e85cc87/volumes/kubernetes.io~csi/pvc-c777ad17-267c-47f2-9590-89ef39278d8e/mount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"csi.storage.k8s.io/ephemeral\":\"false\",\"csi.storage.k8s.io/pod.name\":\"replay-28445880-shl5z\",\"csi.storage.k8s.io/pod.namespace\":\"onkp\",\"csi.storage.k8s.io/pod.uid\":\"dba7f05d-f36d-425b-ae6d-31947e85cc87\",\"csi.storage.k8s.io/serviceAccount.name\":\"default\",\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1699442929278-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-c777ad17-267c-47f2-9590-89ef39278d8e\"}" func=csi.logGRPC file="server.go:132"
2024-02-01T03:00:02.747870728+01:00 time="2024-02-01T02:00:02Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount" func=csi.ensureMountPoint file="util.go:288"
2024-02-01T03:00:02.747926483+01:00 time="2024-02-01T02:00:02Z" level=info msg="Mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/f2e56ac1fe3f80d41386c8918a9a60c805ee90be7a1689579f716cdba08735c8/globalmount try opening and syncing dir to make sure it's healthy" func=csi.ensureMountPoint file="util.go:296"
2024-02-01T03:00:02.842909102+01:00 time="2024-02-01T02:00:02Z" level=info msg="Trying to ensure mount point /var/lib/kubelet/pods/dba7f05d-f36d-425b-ae6d-31947e85cc87/volumes/kubernetes.io~csi/pvc-c777ad17-267c-47f2-9590-89ef39278d8e/mount" func=csi.ensureMountPoint file="util.go:288"
2024-02-01T03:00:02.890631276+01:00 time="2024-02-01T02:00:02Z" level=info msg="NodePublishVolume: rsp: {}" func=csi.logGRPC file="server.go:141"
And it is up until unmounted hours later:
2024-02-01T08:24:59.777344266+01:00 time="2024-02-01T07:24:59Z" level=info msg="NodeUnpublishVolume: req: {\"target_path\":\"/var/lib/kubelet/pods/dba7f05d-f36d-425b-ae6d-31947e85cc87/volumes/kubernetes.io~csi/pvc-c777ad17-267c-47f2-9590-89ef39278d8e/mount\",\"volume_id\":\"pvc-c777ad17-267c-47f2-9590-89ef39278d8e\"}" func=csi.logGRPC file="server.go:132"
Describe the bug
The PVC fails to mount, with the following error in the pods events:
this is caused by the
soft
option for NFS to be cut short tosof
To Reproduce
Install the latest version of longhorn and mount a PVC
Expected behavior
Have the
soft
option set correctly and mount the pvc.Support bundle for troubleshooting
supportbundle_f1466c3d-47f3-4e9a-8b6e-f014a12bb1ca_2024-08-27T07-41-49Z.zip
Environment
Longhorn version: 1.5.2
Impacted volume (PV): img-rwm
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
Node config
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 13