Issues on node reboot - orphaned pods / unbound PVCs / Stale file handles

nin0-0 commented 10 months ago

We're running hpe-csi driver on openshift 4.12

Aug 31 08:51:06 bl9258 kubenswrapper[7086]: E0831 08:51:06.746462    7086 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"87bf952a-94d5-4c70-8ee3-67d2792947b7\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=3

Aug 31 10:13:44 bl9261 kubenswrapper[7619]: E0831 10:13:44.307781    7619 pod_workers.go:965] "Error syncing pod, skipping" err="unmounted volumes=[toolsvm], unattached volumes=[toolsvm podinfo kube-api-access-sgjhk]: error processing PVC mavenirvsreprod-tool-cnf/toolsvm-pvc: PVC is not bound" pod="mavenirvsreprod-tool-cnf/toolsvm-0" podUID=ad20e07c-df37-4141-899b-b47c5a8ffd5d

And last time I even had stale file handles for the NFS server, I had to reboot again in the end.

time="2023-08-30T17:59:02Z" level=error msg="GRPC error: rpc error: code = Internal desc = unknown error when stat on /var/lib/kubelet/pods/81922b78-9a40-4083-8f41-a588042a7c97/volumes/kubernetes.io~csi/pvc-8644abf2-460d-4e5c-ab7e-33ea1e9dfc28/mount: lstat /var/lib/kubelet/pods/81922b78-9a40-4083-8f41-a588042a7c97/volumes/kubernetes.io~csi/pvc-8644abf2-460d-4e5c-ab7e-33ea1e9dfc28/mount: stale NFS file handle" file="utils.go:73"

I'm not sure which logs to provide more, these are the logs for the kubelet in the nodes. It's really paralysing when the nodes don't come up after cluster reboot.

Is there an easy way to detect which pod is for which node in hpe-csi-driver namespace?

datamattsson commented 10 months ago

We're running hpe-csi driver on openshift 4.12

Aug 31 08:51:06 bl9258 kubenswrapper[7086]: E0831 08:51:06.746462    7086 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"87bf952a-94d5-4c70-8ee3-67d2792947b7\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=3

This I haven't seen in a long time. The directory it's trying to remove is usually a file with these errors. I'm don't know the root cause but you can safely remove the file and these errors stop.

Aug 31 10:13:44 bl9261 kubenswrapper[7619]: E0831 10:13:44.307781    7619 pod_workers.go:965] "Error syncing pod, skipping" err="unmounted volumes=[toolsvm], unattached volumes=[toolsvm podinfo kube-api-access-sgjhk]: error processing PVC mavenirvsreprod-tool-cnf/toolsvm-pvc: PVC is not bound" pod="mavenirvsreprod-tool-cnf/toolsvm-0" podUID=ad20e07c-df37-4141-899b-b47c5a8ffd5d

I have no idea what would cause a PVC to become unbound if it once was bound. Was the PVC ever bound?

And last time I even had stale file handles for the NFS server, I had to reboot again in the end.

time="2023-08-30T17:59:02Z" level=error msg="GRPC error: rpc error: code = Internal desc = unknown error when stat on /var/lib/kubelet/pods/81922b78-9a40-4083-8f41-a588042a7c97/volumes/kubernetes.io~csi/pvc-8644abf2-460d-4e5c-ab7e-33ea1e9dfc28/mount: lstat /var/lib/kubelet/pods/81922b78-9a40-4083-8f41-a588042a7c97/volumes/kubernetes.io~csi/pvc-8644abf2-460d-4e5c-ab7e-33ea1e9dfc28/mount: stale NFS file handle" file="utils.go:73"

This is means the NFS client has a file open that the NFS server doesn't recognize or is unavailable.

I'm not sure which logs to provide more, these are the logs for the kubelet in the nodes. It's really paralysing when the nodes don't come up after cluster reboot.

Was this a planned or unplanned reboot? It's safest to drain nodes before rebooting.

Is there an easy way to detect which pod is for which node in hpe-csi-driver namespace?

oc get nodes -o wide will show you which Pod runs on which node. oc get volumeattachments will show you which PV is attached to which node.

nin0-0 commented 10 months ago

We actually had a storage issue, where the iostack crashed during IOM replacement on the alletra - I think this is the PVC unbound one.

We are only doing cluster reboots on applying MachineConfigs or SR-IOV config (networkattachmentdefinition I believe?), which always drains the node when rebooting it - both provoke a cluster reboot.

Today our vendor reported the following (beware, I did online OCR, some chars are missing)

2m43s    Warning FailedAttachVolume pod/config-service-O                        AttachVolume.Attach failed for volume "pvc-fa467d34-0c42-4eca-9a9f-ccbOele8d 
bf2" : rpc error: code = Aborted desc = There is already an operation pending for the specified id ControllerPublishVolume:pvc-fa467d34-0c42-4eca-9a9f-ccbOele8dbf2:41c95b34-ba70-1ca5- 
cb74-1372c1810f05
2m43s    Warning FailedAttachVolume pod/et.cd-O                                 AttachVolume.Attach failed for volume "pvc-55755dcf-3fbc-403e-ab42-dedb253ec
4fc" : rpc error: code = Aborted desc = There is already an operation pending for the specified id ControllerPublishVolume:pvc-55755dcf-3fbc-403e-ab42-dedb253ec4fc:935d5lac-9e57-ca54- 
bc5d-656973f8d529
106s     Warning FailedMount        pod/etcd-O                                  Unable to attach or mount volumes: unmounted volumes=[etcd-vol], unattached
volumes=[etcd-vol etcd-config]: timed out waiting for the condition
103s     Warning FailedMount        pod/config-service-O                        Unable to attach or mount volumes: unmounted volumes=[config-service-vol], u
nattached volumes=[config-service-mgmt-cfg ramdisk-vol cim-config static-config kube-api-access-56cfp mnt-oam-sidecar config-service-vol]: timed out waiting for the condition
41s      Warning FailedAttachVolume pod/etcd-O                                  AttachVolume.Attach failed for volume "pvc-55755dcf-3fbc-403e-ab42-dedb253ec 
4fc" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
41s      Warning FailedAttachVolume pod/config-service-O                        AttachVolume.Attach failed for volume "pvc-fa467d34-0c42-4eca-9a9f-ccb0e1e8d
bf2" : rpc error: code = Internal desc = Failed to add ACL to volume pvc-fa467d34-0c42-4eca-9a9f-ccb0e1e8dbf2 for node &{ b19296 41c95b34-ba70-1ca5-cb74-1372c1810f05 [0xc000716a80] [0 
xc000716ac0 Oxc000716ad0 Oxc000716ae0] [Oxc000716b00 Oxc000716b10] } via CSP, err: Put "htto://

This was when deploying a new service and they had to try multiple times before it finally worked.

Remains the stale file handles of the NFS server, since I've encountered this multiple times. Also I've noticed that sometimes the operator fails to register to the alletra. Will need to back this by some logs preferably. If you have any lengthy and specific commands to specifically filter for this, feel free.

Thanks for your support !

datamattsson commented 10 months ago

41s Warning FailedAttachVolume pod/config-service-O AttachVolume.Attach failed for volume "pvc-fa467d34-0c42-4eca-9a9f-ccb0e1e8d bf2" : rpc error: code = Internal desc = Failed to add ACL to volume pvc-fa467d34-0c42-4eca-9a9f-ccb0e1e8dbf2 for node &{ b19296 41c95b34-ba70-1ca5-cb74-1372c1810f05 [0xc000716a80] [0 xc000716ac0 Oxc000716ad0 Oxc000716ae0] [Oxc000716b00 Oxc000716b10] } via CSP, err: Put "htto://

This needs to be investigated further. It could be a communications error between the CSI driver and CSP.

Is the array under a support contract? Can you work with your HPE rep to file a support case so we can prioritize this?

nin0-0 commented 10 months ago

case opened 5376115048

datamattsson commented 10 months ago

@nin0-0 thanks, I've seen it and we have support engaged.

nin0-0 commented 10 months ago

timed out waiting for external-attacher of csi.hpe.com CSI driver to attach volume pvc-7b71248b-3627-4e60-8b5a-b334419c9191 Volume is already exclusively attached to one node and can't be attached to another

our 2 outstanding issues - fyi here

Eventually the attachments work but it takes up to 15minutes

datamattsson commented 10 months ago

Volume is already exclusively attached to one node and can't be attached to another

The RWO volumes can only be attached to one node at a time and don't support rolling upgrades.

nin0-0 commented 9 months ago

The vendor is deleting the full deployment (helm uninstall ) and recreating, so no rolling update being applied.

We're seeing the volumes getting attached eventually, but it takes a long time (up to 15 minutes) - is this expected behavior?

nin0-0 commented 9 months ago

we added hostseesvlun parameter now for RWO - do we also need it for RWX?

and on cluster restart, still seeing e.g.: time="2023-09-27T10:44:51Z" level=error msg="GRPC error: rpc error: code = Internal desc = unknown error when stat on /var/lib/kubelet/pods/eaf40870-019e-48a3-92b3-081f423f139d/volumes/kubernetes.io~csi/pvc-f616a530-93d3-4320-86c9-b7d219f90442/mount: lstat /var/lib/kubelet/pods/eaf40870-019e-48a3-92b3-081f423f139d/volumes/kubernetes.io~csi/pvc-f616a530-93d3-4320-86c9-b7d219f90442/mount: stale NFS file handle" file="utils.go:73"

datamattsson commented 9 months ago

hostseesvlun is set on the StorageClass spawning both RWX/RWO volumes but only really apply to the RWO volume.

The vendor is deleting the full deployment (helm uninstall ) and recreating, so no rolling update being applied.

I meant applications leveraging PVCs.

and on cluster restart, still seeing

Is this on existing or new NFS servers?

nin0-0 commented 9 months ago

still seeing stale file handles and orphaned pods errors on cluster reboots. And when redeploying a resource a couple of times (but not removing the PVCs), we get multi attach error.

hostseesvlun is set on the StorageClass spawning both RWX/RWO volumes but only really apply to the RWO volume.

Do you mean I do have to also set on my SC for RWX?

The RWO volumes can only be attached to one node at a time and don't support rolling upgrades. I meant applications leveraging PVCs.

I have seen that it takes a while for larger volumes to get created and bound. Is this maybe why we're having issues when rebooting/recreating?

Is this on existing or new NFS servers?

We only use the integrated NFS servers, but I guess there's only 1 NFS server on one node at any point. Might this be why it seemingly starts failing during a cluster reboot when it's draining one particular host? I thought this was mitigated by pod monitor?

nin0-0 commented 9 months ago

Oct 11 17:15:06 bl9278 kubenswrapper[6996]: W1011 17:15:06.831079 6996 empty_dir.go:523] Warning: Failed to clear quota on /var/lib/kubelet/pods/ec92b1bf-fd1c-4b46-99e2-aab4cbbdb942/volumes/kubernetes.io~configmap/config-volume: clearQuota called, but quotas disabled

Is this a relevant error or we can ignore?

nin0-0 commented 9 months ago

do we need to configure this label for each deploy/statefulset using RWX / NFS server?


  template:
    metadata:
      labels:
        monitored-by: hpe-csi
        app: my-app

datamattsson commented 9 months ago

I'm on vacation and unable to reply in the coming weeks. The Pod Monitor is mainly for apps using RWO PVCs, not NFS/RWX PVCs.

On Thu, Oct 12, 2023, 10:34 AM nin_o_0 @.***> wrote:

do we need to configure this label for each deploy/statefulset using RWX / NFS server?

template: metadata: labels: monitored-by: hpe-csi app: my-app

— Reply to this email directly, view it on GitHub https://github.com/hpe-storage/csi-driver/issues/355#issuecomment-1759174876, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTK2FKMODBDM5BBCI7AG3X66TS5ANCNFSM6AAAAAA4F2UPBQ . You are receiving this because you commented.Message ID: @.***>

dileepds commented 9 months ago

still seeing stale file handles and orphaned pods errors on cluster reboots. And when redeploying a resource a couple of times (but not removing the PVCs), we get multi attach error.

hostseesvlun is set on the StorageClass spawning both RWX/RWO volumes but only really apply to the RWO volume.

Do you mean I do have to also set on my SC for RWX?

Yes you can set it for SC of RWX PVCs if it is using storage from Alletra.

The RWO volumes can only be attached to one node at a time and don't support rolling upgrades. I meant applications leveraging PVCs.

I have seen that it takes a while for larger volumes to get created and bound. Is this maybe why we're having issues when rebooting/recreating?

Yes Its Possible.

Is this on existing or new NFS servers?

We only use the integrated NFS servers, but I guess there's only 1 NFS server on one node at any point. Might this be why it seemingly starts failing during a cluster reboot when it's draining one particular host? I thought this was mitigated by pod monitor?

By Default if there are no explicit lablels hpe-csi driver considers all the available worker nodes to spin up nfs-server pod, [ https://scod.hpedev.io/csi_driver/using.html#using_the_nfs_server_provisioner ]

dileepds commented 9 months ago

Oct 11 17:15:06 bl9278 kubenswrapper[6996]: W1011 17:15:06.831079 6996 empty_dir.go:523] Warning: Failed to clear quota on /var/lib/kubelet/pods/ec92b1bf-fd1c-4b46-99e2-aab4cbbdb942/volumes/kubernetes.io~configmap/config-volume: clearQuota called, but quotas disabled

Is this a relevant error or we can ignore?

I think the error is not related to PVCs , it seems like an error about config maps .

nin0-0 commented 9 months ago

    The RWO volumes can only be attached to one node at a time and don't support rolling upgrades.
    I meant applications leveraging PVCs.

I have seen that it takes a while for larger volumes to get created and bound. Is this maybe why we're having issues when rebooting/recreating?

Yes Its Possible.

I've added hostseesvlun to the nfs SC now, will come back

nin0-0 commented 8 months ago

@dileepds On cluster reboot now we're seeing: Warning FailedMount 10m kubelet MountVolume.SetUp failed for volume "pvc-8fa08040-6717-4109-aff1-adb1da95902e" : rpc error: code = Internal desc = Error mounting nfs share 172.30.18.86:/export at /var/lib/kubelet/pods/0cc6caf6-1abf-48e6-86c1-ce97431fa5cb/volumes/kubernetes.io~csi/pvc-8fa08040-6717-4109-aff1-adb1da95902e/mount, err error command mount with pid: 257 killed as timeout of 60 seconds reached And one pod's container is creating state, stuck: hpe-nfs pod/hpe-nfs-8fa08040-6717-4109-aff1-adb1da95902e-6c46c8cf45-cdpcb 0/1 ContainerCreating 0 38m <none> bl9279 <none> <none>

Warning FailedAttachVolume 90s (x4 over 15m) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-d94e9949-f273-49b7-a02b-f14888723f7e" : rpc error: code = Internal desc = Failed to add ACL to volume pvc-d94e9949-f273-49b7-a02b-f14888723f7e for node &{ bl9279 782ef740-5fee-9b5c-ec27-77d233a22990 [0xc00050ad80] [0xc00050adc0 0xc00050add0 0xc00050ade0] [0xc00050ae40 0xc00050ae50] } via CSP, err: Request failed with status code 400 and errors Error code (Bad Request) and message (HTTP error response from backend {"code":43,"desc":"invalid input: number exceeds expected range","ref":"lun"}) Warning FailedMount 80s (x14 over 36m) kubelet Unable to attach or mount volumes: unmounted volumes=[hpe-nfs-8fa08040-6717-4109-aff1-adb1da95902e], unattached volumes=[hpe-nfs-8fa08040-6717-4109-aff1-adb1da95902e hpe-nfs-config kube-api-access-jphcc]: timed out waiting for the condition

Any ideas?

nin0-0 commented 8 months ago

Warning  FailedMount  5s (x4 over 23s)  kubelet            MountVolume.SetUp failed for volume "pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d" : rpc error: code = Internal desc = Error mounting nfs share 172.30.55.198:/export at /var/lib/kubelet/pods/b665ee7f-3aa0-41ab-9242-4cf429d1d26b/volumes/kubernetes.io~csi

and seeing this:

[root@bl1459 ticams003]# oc get po -n hpe-nfs -o wide
NAME                                                            READY   STATUS              RESTARTS   AGE     IP              NODE     NOMINATED NODE   READINESS GATES
hpe-nfs-081ecb26-831f-422f-8644-8900114793f6-758d4ccc77-8fvzn   1/1     Running             0          8d      172.16.30.24    bl9279   <none>           <none>
hpe-nfs-3416f79d-60e1-4b7e-a65b-34506e69d9a3-546f94b57c-z2qlg   1/1     Running             0          8d      172.16.31.250   bl9279   <none>           <none>
hpe-nfs-342333d0-d01f-4eb3-bb95-fc09960b4778-5cb9886f5f-5w7zh   1/1     Running             0          8d      172.16.30.28    bl9279   <none>           <none>
hpe-nfs-3aee3a42-6564-4045-8921-dfc42c743139-7768fc886f-stxl5   1/1     Running             0          8d      172.16.30.51    bl9279   <none>           <none>
hpe-nfs-45040c1b-2ab0-473b-a3ec-fc6bc7bb157d-5d6848d48c-brrfp   1/1     Running             0          8d      172.16.30.39    bl9279   <none>           <none>
hpe-nfs-476fbea8-229f-4254-bef3-cf46d3a4f2cf-86c6cb9997-vqhh5   1/1     Running             0          8d      172.16.30.50    bl9279   <none>           <none>
hpe-nfs-57664a80-0921-462e-a827-248d64bd66cf-56c6ff7797-dhqrm   1/1     Running             0          8d      172.16.30.45    bl9279   <none>           <none>
hpe-nfs-66600fc8-4a48-440f-ab4b-8b5f780480f4-6b575cbf6f-sr22j   1/1     Running             0          8d      172.16.30.37    bl9279   <none>           <none>
hpe-nfs-7564d89c-c8e7-471c-b12d-97ea6d8599a4-6564f749f7-rj4n6   1/1     Running             0          8d      172.16.31.251   bl9279   <none>           <none>
hpe-nfs-8fa08040-6717-4109-aff1-adb1da95902e-6c46c8cf45-hskdx   0/1     ContainerCreating   0          3h31m   <none>          bl9273   <none>           <none>
hpe-nfs-9aeb8ba1-a70a-47cf-bfe1-c5b903591485-6d4b57f47f-fvbw4   1/1     Running             0          8d      172.16.30.32    bl9279   <none>           <none>
hpe-nfs-9b1dc917-49f8-46da-8e55-0da01e685330-d5fc5c9c7-65m8j    1/1     Running             0          8d      172.16.30.44    bl9279   <none>           <none>
hpe-nfs-aacd8ea9-ee20-4d46-b520-b1401f911df1-596f8b888f-8bhhd   1/1     Running             0          8d      172.16.30.55    bl9279   <none>           <none>
hpe-nfs-ac6f85cf-dc1c-4571-b510-4d4f02b12954-6fc55458f7-h2wfk   1/1     Running             0          8d      172.16.30.48    bl9279   <none>           <none>
hpe-nfs-ae1438b0-ecab-454a-a3c9-5403e73a600a-5d47db879f-tpnpl   1/1     Running             0          8d      172.16.30.34    bl9279   <none>           <none>
hpe-nfs-b1ae7789-6111-4d3f-ad20-ecfff97f4620-86979bfc85-j7fcv   1/1     Running             0          8d      172.16.30.33    bl9279   <none>           <none>
hpe-nfs-b76bdfb3-5e11-463e-bfd6-60d74115ab37-6877ff59b5-nmrf8   1/1     Running             0          8d      172.16.30.42    bl9279   <none>           <none>
hpe-nfs-b8438a6b-3dbc-44e4-ab4a-2d1853587233-54c4c698b7-nrbb9   1/1     Running             0          8d      172.16.30.35    bl9279   <none>           <none>
hpe-nfs-d9168e58-2a17-490a-9253-07d40999c5e5-57f5758c5f-54nlc   1/1     Running             0          8d      172.16.30.43    bl9279   <none>           <none>
hpe-nfs-df6e71ca-1df7-4a5a-96cd-1a1c6454341d-6f99f7558f-fbshr   1/1     Running             0          8d      172.16.30.40    bl9279   <none>           <none>
hpe-nfs-eaff5f8f-dedd-46a1-9ce0-63f8f75593c8-76bb66b8cf-9lf45   1/1     Running             0          8d      172.16.30.52    bl9279   <none>           <none>
hpe-nfs-f60b695f-bc6c-4864-80a1-df26e660fb81-56799bf6d7-wjskp   1/1     Running             0          8d      172.16.30.58    bl9279   <none>           <none>
hpe-nfs-fd0b1265-e869-44b3-87c1-9121da90c995-75476d8db9-pkd85   1/1     Running             0          8d      172.16.30.46    bl9279   <none>           <none>

1 pod gets constantly scheduled on the wrong node

datamattsson commented 8 months ago

CSP, err: Request failed with status code 400 and errors Error code (Bad Request) and message (HTTP error response from backend {"code":43,"desc":"invalid input: number exceeds expected range","ref":"lun"})

This is interesting, can we get the logs from the CSP to see the full context of this request?

nin0-0 commented 8 months ago

[root@bl1459 ticams003]# oc get po -n  mavenirvsreprod-ms   zookeeper-1 -o wide
NAME          READY   STATUS              RESTARTS   AGE    IP       NODE     NOMINATED NODE   READINESS GATES
zookeeper-1   0/1     ContainerCreating   0          118m   <none>   bl9273   <none>           <none>

[root@bl1459 ticams003]# oc logs hpe-nfs-8fa08040-6717-4109-aff1-adb1da95902e-6c46c8cf45-hskdx
Error from server (BadRequest): container "hpe-nfs" in pod "hpe-nfs-8fa08040-6717-4109-aff1-adb1da95902e-6c46c8cf45-hskdx" is waiting to start: ContainerCreating

Nov 02 18:15:27 bl9273 kubenswrapper[6959]: I1102 18:15:27.502390    6959 operation_generator.go:616] "MountVolume.WaitForAttach entering for volume \"pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d\" (UniqueName: \"kubernetes.io/csi/csi.hpe.com^df6e71ca-1df7-4a5a-96cd-1a1c6454341d\") pod \"zookeeper-1\" (UID: \"b665ee7f-3aa0-41ab-9242-4cf429d1d26b\") DevicePath \"csi-929b2b1a8ea307d2fe217162be3fdbd792271f6e7288bbbafbd4afe00b3ca322\"" pod="mavenirvsreprod-ms/zookeeper-1"
Nov 02 18:15:27 bl9273 kubenswrapper[6959]: I1102 18:15:27.505695    6959 operation_generator.go:626] "MountVolume.WaitForAttach succeeded for volume \"pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d\" (UniqueName: \"kubernetes.io/csi/csi.hpe.com^df6e71ca-1df7-4a5a-96cd-1a1c6454341d\") pod \"zookeeper-1\" (UID: \"b665ee7f-3aa0-41ab-9242-4cf429d1d26b\") DevicePath \"csi-929b2b1a8ea307d2fe217162be3fdbd792271f6e7288bbbafbd4afe00b3ca322\"" pod="mavenirvsreprod-ms/zookeeper-1"
Nov 02 18:15:27 bl9273 kubenswrapper[6959]: E1102 18:15:27.534311    6959 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.hpe.com^df6e71ca-1df7-4a5a-96cd-1a1c6454341d podName: nodeName:}" failed. No retries permitted until 2023-11-02 18:15:28.034299437 +0000 UTC m=+885180.261145961 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d" (UniqueName: "kubernetes.io/csi/csi.hpe.com^df6e71ca-1df7-4a5a-96cd-1a1c6454341d") pod "zookeeper-1" (UID: "b665ee7f-3aa0-41ab-9242-4cf429d1d26b") : rpc error: code = Internal desc = Error mounting nfs share 172.30.55.198:/export at /var/lib/kubelet/pods/b665ee7f-3aa0-41ab-9242-4cf429d1d26b/volumes/kubernetes.io~csi/pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d/mount, err fork/exec /chroot/mount: resource temporarily unavailable
Nov 02 18:15:27 bl9273 kubenswrapper[6959]: I1102 18:15:27.602797    6959 reconciler.go:211] "operationExecutor.UnmountVolume started for volume \"zookeeper-volume\" (UniqueName: \"kubernetes.io/csi/csi.hpe.com^df6e71ca-1df7-4a5a-96cd-1a1c6454341d\") pod \"89065e7a-9f57-4bf7-85a6-2857024b6fe8\" (UID: \"89065e7a-9f57-4bf7-85a6-2857024b6fe8\") "
Nov 02 18:15:33 bl9273 kubenswrapper[6959]: E1102 18:15:33.610291    6959 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.hpe.com^df6e71ca-1df7-4a5a-96cd-1a1c6454341d podName:89065e7a-9f57-4bf7-85a6-2857024b6fe8 nodeName:}" failed. No retries permitted until 2023-11-02 18:15:34.110274202 +0000 UTC m=+885186.337120727 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "zookeeper-volume" (UniqueName: "kubernetes.io/csi/csi.hpe.com^df6e71ca-1df7-4a5a-96cd-1a1c6454341d") pod "89065e7a-9f57-4bf7-85a6-2857024b6fe8" (UID: "89065e7a-9f57-4bf7-85a6-2857024b6fe8") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = Error unmounting target path /var/lib/kubelet/pods/89065e7a-9f57-4bf7-85a6-2857024b6fe8/volumes/kubernetes.io~csi/pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d/mount, err: Unable to get mounted device from mountpoint /var/lib/kubelet/pods/89065e7a-9f57-4bf7-85a6-2857024b6fe8/volumes/kubernetes.io~csi/pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d/mount, err: fork/exec /chroot/mount: resource temporarily unavailable

 oc logs pod/primera3par-csp-6bd884b69f-pjvl2 | tail -100
[..]
time="2023-11-02T18:19:57Z" level=info msg="Response status and status code of POST on vluns is: 400 Bad Request 400" file="client.go:89"
time="2023-11-02T18:19:57Z" level=info msg="=== CREATE VLUN RESP FROM ARRAY: ===\n &{400 Bad Request 400 HTTP/2.0 2 0 map[Cache-Control:[no-cache, no-store, private, max-age=-1, must-revalidate] Content-Language:[en-US] Content-Length:[77] Content-Security-Policy:[default-src 'self' https://www.google-analytics.com/ ; script-src 'unsafe-eval' 'self' https://www.google-analytics.com/analytics.js 'sha256-v0Q8X6xSNmwWcfldSEWqW+3imFpJurfHYdywm3hw6+w=' ; style-src   'nonce-d29tZSBjb29sIHN0cmluZyB3aWxsIHBvcCB1cCAxMjM=' 'self' blob:; frame-src 'self'; object-src 'self';  media-src 'self' data: blob:; font-src 'self' data:; frame-ancestors 'self'] Content-Type:[application/json] Date:[Thu, 02 Nov 2023 18:19:58 GMT] Pragma:[no-cache] Server:[hp3par-wsapi] Strict-Transport-Security:[max-age=31536000; includeSubDomains;] Vary:[Accept-Encoding Origin] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[1; mode=block]] {0xc0006b9200} 77 [] false false map[] 0xc0006c6200 0xc00052e420}" file="managed_client.go:583"
time="2023-11-02T18:19:57Z" level=info msg="Request: action=POST path=https://10.244.89.6:443/api/v1/vluns" file="http.go:91"
time="2023-11-02T18:19:57Z" level=info msg="HTTP Response: {\n    \"code\": 43,\n    \"desc\": \"invalid input: number exceeds expected range\",\n    \"ref\": \"lun\"\n }" file="utils.go:46"
time="2023-11-02T18:19:57Z" level=info msg="Response status 400 Bad Request" file="rest_client.go:38"
time="2023-11-02T18:19:57Z" level=info msg="Response status and status code of POST on vluns is: 400 Bad Request 400" file="client.go:89"
time="2023-11-02T18:19:57Z" level=info msg="=== CREATE VLUN RESP FROM ARRAY: ===\n &{400 Bad Request 400 HTTP/2.0 2 0 map[Cache-Control:[no-cache, no-store, private, max-age=-1, must-revalidate] Content-Language:[en-US] Content-Length:[77] Content-Security-Policy:[default-src 'self' https://www.google-analytics.com/ ; script-src 'unsafe-eval' 'self' https://www.google-analytics.com/analytics.js 'sha256-v0Q8X6xSNmwWcfldSEWqW+3imFpJurfHYdywm3hw6+w=' ; style-src   'nonce-d29tZSBjb29sIHN0cmluZyB3aWxsIHBvcCB1cCAxMjM=' 'self' blob:; frame-src 'self'; object-src 'self';  media-src 'self' data: blob:; font-src 'self' data:; frame-ancestors 'self'] Content-Type:[application/json] Date:[Thu, 02 Nov 2023 18:19:58 GMT] Pragma:[no-cache] Server:[hp3par-wsapi] Strict-Transport-Security:[max-age=31536000; includeSubDomains;] Vary:[Accept-Encoding Origin] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[1; mode=block]] {0xc0004a4900} 77 [] false false map[] 0xc0001b0200 0xc0006ce580}" file="managed_client.go:583"
time="2023-11-02T18:19:57Z" level=info msg="Raising panic with following message: error while creating luns" file="utils.go:166"
time="2023-11-02T18:19:57Z" level=info msg="[ REQUEST-ID 103220 ] -- <<<<< Create VLUN Cmd: [Array: 10.244.89.6] - Host bl9273 and Volume pvc-d94e9949-f273-49b7-a02b-f14888723f7e" file="panic.go:884"
time="2023-11-02T18:19:57Z" level=trace msg="Releasing mutex lock for PublishVolume:pvc-d94e9949-f273-49b7-a02b-f14888723f7e:2fb30bf8-4c25-ba05-187c-c82b4e3c58f6" file="concurrent.go:67"
time="2023-11-02T18:19:57Z" level=info msg="[ REQUEST-ID 103220 ] -- <<<<< Publish Volume Cmd for volume pvc-d94e9949-f273-49b7-a02b-f14888723f7e and host 2fb30bf8-4c25-ba05-187c-c82b4e3c58f6" file="panic.go:884"
time="2023-11-02T18:19:57Z" level=info msg="CspManager: Recovering from panic..." file="csp_manager.go:42"
time="2023-11-02T18:19:57Z" level=info msg="Sending following message to CSI driver: error while creating luns" file="utils.go:173"
time="2023-11-02T18:19:57Z" level=info msg="Error response to be sent is &models.ErrorsPayload{Errors:[]models.ErrorObject{models.ErrorObject{Code:\"Bad Request\", Message:\"HTTP error response from backend {\\\"code\\\":43,\\\"desc\\\":\\\"invalid input: number exceeds expected range\\\",\\\"ref\\\":\\\"lun\\\"}\"}}}" file="utils.go:181"

can it be the API is down or something? I get 404 page not found on https://10.244.89.6/api/v1 and {"code":6,"desc":"invalid session key"} on https://10.244.89.6/api/v1/vluns

oc logs pod/hpe-csi-controller-6c6bd854d7-hk8r8 -c hpe-csi-driver | tail -200
time="2023-11-02T17:49:52Z" level=info msg="GRPC call: /csi.v1.Controller/ControllerPublishVolume" file="utils.go:69"
time="2023-11-02T17:49:52Z" level=info msg="GRPC request: {\"node_id\":\"2fb30bf8-4c25-ba05-187c-c82b4e3c58f6\",\"secrets\":\"***stripped***\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"xfs\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"accessProtocol\":\"fc\",\"compression\":\"false\",\"cpg\":\"\",\"description\":\"Volume created by the HPE CSI Driver for Kubernetes\",\"fsMode\":\"0777\",\"fsType\":\"xfs\",\"hostEncryption\":\"false\",\"nfsPVC\":\"true\",\"nfsResources\":\"true\",\"provisioningType\":\"tpvv\",\"snapCpg\":\"\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1694638608831-8081-csi.hpe.com\",\"volumeAccessMode\":\"mount\"},\"volume_id\":\"pvc-d94e9949-f273-49b7-a02b-f14888723f7e\"}" file="utils.go:70"
time="2023-11-02T17:49:54Z" level=error msg="status code was 400 Bad Request for request: action=PUT path=http://alletra9000-csp-svc:8080/containers/v1/volumes/pvc-d94e9949-f273-49b7-a02b-f14888723f7e/actions/publish, attempting to decode error response." file="client.go:184"
time="2023-11-02T17:49:54Z" level=error msg="HTTP error 400.  Error payload: Error code (Bad Request) and message (HTTP error response from backend {\"code\":43,\"desc\":\"invalid input: number exceeds expected range\",\"ref\":\"lun\"})" file="container_storage_provider.go:956"
time="2023-11-02T17:49:54Z" level=error msg="Failed to publish volume pvc-d94e9949-f273-49b7-a02b-f14888723f7e, err: Request failed with status code 400 and errors Error code (Bad Request) and message (HTTP error response from backend {\"code\":43,\"desc\":\"invalid input: number exceeds expected range\",\"ref\":\"lun\"})" file="controller_server.go:868"
time="2023-11-02T17:49:54Z" level=error msg="Error controller publishing volume pvc-d94e9949-f273-49b7-a02b-f14888723f7e, err: rpc error: code = Internal desc = Failed to add ACL to volume pvc-d94e9949-f273-49b7-a02b-f14888723f7e for node &{ bl9273 2fb30bf8-4c25-ba05-187c-c82b4e3c58f6 [0xc0005d6a80] [0xc0005d6ac0 0xc0005d6ad0 0xc0005d6ae0] [0xc0005d6b00 0xc0005d6b10]  } via CSP, err: Request failed with status code 400 and errors Error code (Bad Request) and message (HTTP error response from backend {\"code\":43,\"desc\":\"invalid input: number exceeds expected range\",\"ref\":\"lun\"})" file="controller_server.go:726"

[root@bl1459 ticams003]# oc get pv  | grep pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d
pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d   6Gi           RWX            Delete           Bound    mavenirvsreprod-ms/zookeeper-volume-claim                                      hpe-nfs                 49d

pvc-df6e71ca-1df7-4a5a-96cd-1a1c6454341d   6Gi           RWX            Delete           Bound    mavenirvsreprod-ms/zookeeper-volume-claim                                      hpe-nfs                 49d

[root@bl1459 ticams003]# oc get pv  | grep pvc-d94e9949-f273-49b7-a02b-f14888723f7e
pvc-d94e9949-f273-49b7-a02b-f14888723f7e   60Gi          RWO            Delete           Bound    hpe-nfs/hpe-nfs-8fa08040-6717-4109-aff1-adb1da95902e                           hpe-nfs                 49d

[root@bl1459 ticams003]# oc get pvc -n mavenirvsreprod-ms | grep adb1da95902e
influxdb-volume-claim         Bound    pvc-8fa08040-6717-4109-aff1-adb1da95902e   60Gi       RWX            hpe-nfs        49d

anything more?

nin0-0 commented 8 months ago

Seemingly the allowMutations is not consistently working to add the hostSeesVLUN to the existing PVCs.

I've not (with help of HPE support) added it for all PVCs in all namespace with a little script.

Should I remove allowMutations ?

datamattsson commented 8 months ago

Did you follow the steps and caveats here: https://scod.hpedev.io/container_storage_provider/hpe_alletra_9000/index.html#vlun_templates ?

nin0-0 commented 8 months ago

I hadn't read this correctly (and not at all at first, I was following HPE support instructions). Thanks for the pointer !

nin0-0 commented 4 months ago

no issues anymorethanks a lot for the help

hpe-storage / csi-driver

Issues on node reboot - orphaned pods / unbound PVCs / Stale file handles #355