hpe csi driver nfs mount errors

nin0-0 commented 4 months ago

OCP 4.12 hpe csi driver 2.4.0

We're trying to deploy IBM TNCP (proviso) and only some pods of the deployment are failing with:

  Warning  FailedMount  80m (x2 over 88m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[logs pack-content], unattached volumes=[logs pack-content kube-api-access-pzsdp keystore-security sessions-security work-pack]: timed out waiting for the condition
  Warning  FailedMount  31m (x6 over 62m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[pack-content], unattached volumes=[kube-api-access-pzsdp keystore-security sessions-security work-pack logs pack-content]: timed out waiting for the condition
  Warning  FailedMount  10m (x4 over 55m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[pack-content], unattached volumes=[logs pack-content kube-api-access-pzsdp keystore-security sessions-security work-pack]: timed out waiting for the condition
  Warning  FailedMount  6m31s               kubelet            Unable to attach or mount volumes: unmounted volumes=[pack-content], unattached volumes=[pack-content kube-api-access-pzsdp keystore-security sessions-security work-pack logs]: timed out waiting for the condition
  Warning  FailedMount  21s (x60 over 88m)  kubelet            (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[pack-content], unattached volumes=[sessions-security work-pack logs pack-content kube-api-access-pzsdp keystore-security]: timed out waiting for the condition

In the related hpe-csi-driver pod we see this:

time="2024-03-06T10:53:33Z" level=error msg="GRPC error: rpc error: code = Internal desc = Error mounting nfs share 172.30.57.231:/export at /var/lib/kubelet/pods/a22c9293-363a-40ba-9f05-fdc975e776f9/volumes/kubernetes.io~csi/pvc-259d76de-990e-43b7-b8de-4c28d87580c7/mount, err error command mount with pid: 2393 killed as timeout of 60 seconds reached" file="utils.go:73"
time="2024-03-06T10:54:19Z" level=error msg="\n Error in GetSecondaryBackends unexpected end of JSON input" file="volume.go:87"
time="2024-03-06T10:54:19Z" level=error msg="\n Passed details " file="volume.go:88"
time="2024-03-06T10:55:50Z" level=error msg="command mount with pid: 2424 killed as timeout of 60 seconds reached" file="cmd.go:60"
time="2024-03-06T10:55:50Z" level=error msg="GRPC error: rpc error: code = Internal desc = Error mounting nfs share 172.30.25.60:/export at /var/lib/kubelet/pods/29023752-be2c-499d-b92d-72373b423188/volumes/kubernetes.io~csi/pvc-69fe50cd-2f74-44b4-bb13-dc8c51add505/mount, err error command mount with pid: 2424 killed as timeout of 60 seconds reached" file="utils.go:73"
time="2024-03-06T10:56:35Z" level=error msg="command mount with pid: 2429 killed as timeout of 60 seconds reached" file="cmd.go:60"

Please help us to debug this and/or point the relevant issue.

# oc get sc -o yaml hpe-nfs
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "4"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"argocd.argoproj.io/sync-wave":"4"},"labels":{"app.kubernetes.io/instance":"bru11-nprod-be-01-infra"},"name":"hpe-nfs"},"parameters":{"accessProtocol":"fc","allowMutations":"hostSeesVLUN","csi.storage.k8s.io/controller-expand-secret-name":"hpe-backend","csi.storage.k8s.io/controller-expand-secret-namespace":"hpe-storage","csi.storage.k8s.io/controller-publish-secret-name":"hpe-backend","csi.storage.k8s.io/controller-publish-secret-namespace":"hpe-storage","csi.storage.k8s.io/fstype":"xfs","csi.storage.k8s.io/node-publish-secret-name":"hpe-backend","csi.storage.k8s.io/node-publish-secret-namespace":"hpe-storage","csi.storage.k8s.io/node-stage-secret-name":"hpe-backend","csi.storage.k8s.io/node-stage-secret-namespace":"hpe-storage","csi.storage.k8s.io/provisioner-secret-name":"hpe-backend","csi.storage.k8s.io/provisioner-secret-namespace":"hpe-storage","description":"Volume created by the HPE CSI Driver for Kubernetes","fsMode":"0777","hostSeesVLUN":"true","nfsResources":"true"},"provisioner":"csi.hpe.com","reclaimPolicy":"Delete","volumeBindingMode":"Immediate"}
  creationTimestamp: "2023-11-23T17:31:40Z"
  labels:
    app.kubernetes.io/instance: bru11-nprod-be-01-infra
  name: hpe-nfs
  resourceVersion: "62518251"
  uid: afc44a3f-6ba2-4ac7-864a-9106dbd01173
parameters:
  accessProtocol: fc
  allowMutations: hostSeesVLUN
  csi.storage.k8s.io/controller-expand-secret-name: hpe-backend
  csi.storage.k8s.io/controller-expand-secret-namespace: hpe-storage
  csi.storage.k8s.io/controller-publish-secret-name: hpe-backend
  csi.storage.k8s.io/controller-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/fstype: xfs
  csi.storage.k8s.io/node-publish-secret-name: hpe-backend
  csi.storage.k8s.io/node-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/node-stage-secret-name: hpe-backend
  csi.storage.k8s.io/node-stage-secret-namespace: hpe-storage
  csi.storage.k8s.io/provisioner-secret-name: hpe-backend
  csi.storage.k8s.io/provisioner-secret-namespace: hpe-storage
  description: Volume created by the HPE CSI Driver for Kubernetes
  fsMode: "0777"
  hostSeesVLUN: "true"
  nfsResources: "true"
provisioner: csi.hpe.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

thanks

nin0-0 commented 4 months ago

We've noticed the automatic upgrade was enabled for the operator and it's in broken state.

      Name:       certified-operators
      Namespace:  openshift-marketplace
    Identifier:   hpe-csi-operator.v2.4.1
    Path:         registry.connect.redhat.com/hpestorage/csi-driver-operator-bundle@sha256:b5f87d6a9c7ec3a4d53204e86d0fa57d10d4aba2eeb1882f0a1b1caa19c7d9fd
    Properties:   {"properties":[{"type":"olm.gvk","value":{"group":"storage.hpe.com","kind":"HPECSIDriver","version":"v1"}},{"type":"olm.package","value":{"packageName":"hpe-csi-operator","version":"2.4.1"}}]}
    Replaces:     hpe-csi-operator.v2.4.0
  Catalog Sources:
  Conditions:
    Last Transition Time:  2024-03-04T17:51:58Z
    Last Update Time:      2024-03-04T17:51:58Z
    Message:               error validating existing CRs against new CRD's schema for "hpecsidrivers.storage.hpe.com": error validating custom resource against new schema for HPECSIDriver hpe-csi-driver/csi-driver: [].spec.disable.alletraStorageMP: Required value

can you advise what to do to resolve? 2.4.0 is in replacing state and 2.4.1 is in Pending

nin0-0 commented 4 months ago

only found this: https://github.com/hpe-storage/csi-driver/blob/master/release-notes/v2.4.1.md not in releases even

should we edit the old CRD to include the disable.alletraStorageMP ?

or follow this https://scod.hpedev.io/partners/redhat_openshift/index.html#upgrading

datamattsson commented 4 months ago

The operator is currently broken. We're working with Red Hat to have it resolved.

datamattsson commented 4 months ago

The 2.4.0 release has been restored.

nin0-0 commented 4 months ago

how do we get out of the broken state though?

hpe-storage / csi-driver

hpe csi driver nfs mount errors #385