hpe-storage / truenas-csp

TrueNAS Container Storage Provider for HPE CSI Driver for Kubernetes
https://scod.hpedev.io
MIT License
65 stars 8 forks source link

CSI driver not able to find the device on the K8s node #37

Closed klacus closed 1 year ago

klacus commented 1 year ago

Hi, I am posting the issue here with the log files as asked by datamattsson on Slack

I am having issues using Permanent Volumes hosted on TrueNas Scale. I hope this is the right forum to ask this question and someone could tell me what I am doing wrong.The volume gets provisioned on TrueNas and the PV gets created on K8s too. However when the pod starts there is an error message about kubelet failing to mount the volume.

Node OS: Linux k8sw2 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Release:

root@k8sw2:/home/pgsadmin# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

TrueNas Scale TrueNAS-SCALE-22.02.4 K8s version:

╴  dev on  K8sLab(myns) at …/deployment/k8s/helm is 📦 v1.0.0 via  v3.10.2
╰─ k version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:38:15Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}

HPE CSI Version:

╭╴  dev on  K8sLab(hpe-storage) at …/deployment/k8s/helm is 📦 v1.0.0 via  v3.10.2
╰─ k logs hpe-csi-controller-65f4f97cb7-jrr5g
Defaulted container "csi-provisioner" out of: csi-provisioner, csi-attacher, csi-snapshotter, csi-resizer, hpe-csi-driver, csi-volume-mutator, csi-volume-group-snapshotter, csi-volume-group-provisioner, csi-extensions
I1125 14:59:01.456330       1 feature_gate.go:245] feature gates: &{map[]}
I1125 14:59:01.456373       1 csi-provisioner.go:139] Version: v3.1.0
I1125 14:59:01.456377       1 csi-provisioner.go:162] Building kube configs for running in cluster...
I1125 14:59:01.517832       1 connection.go:154] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

TrueNas CSP: quay.io/datamattsson/truenas-csp:v2.2.0

The csi driver is installed by the TrueNas CSP helm chart 1.1.2 from https://artifacthub.io/packages/helm/truenas-csp/truenas-csp The error message from output of describing the pod:

Volumes:
  datavolume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datapvc
    ReadOnly:   false
  kube-api-access-cp9h5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               37m                   default-scheduler        Successfully assigned myns/mysts-0 to k8sw2
  Normal   SuccessfulAttachVolume  37m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-2fed2d40-e721-4846-8671-2d7040e8a7ca"
  Warning  FailedMount             10m (x17 over 36m)    kubelet                  MountVolume.MountDevice failed for volume "pvc-2fed2d40-e721-4846-8671-2d7040e8a7ca" : rpc error: code = Internal desc = Failed to stage volume k8spool_k8s_pvc-2fed2d40-e721-4846-8671-2d7040e8a7ca, err: rpc error: code = Internal desc = Error creating device for volume k8spool_k8s_pvc-2fed2d40-e721-4846-8671-2d7040e8a7ca, err: device not found with serial 6589cfc000000110a23aaee10b0dd1b4 or target
  Warning  FailedMount             5m56s (x12 over 35m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[datavolume[], unattached volumes=[datavolume kube-api-access-cp9h5[]: timed out waiting for the condition

And I noticed an error on the console of the worker node:

[   24.257961] device-mapper: multipath service-time: version 0.3.0 loaded
[   24.258668] device-mapper: table: 253:0: multipath: error getting device
[   24.258696] device-mapper: ioctl: error adding target to table

I also noticed a warning when added a new node (with kubeadm) to the existing cluster about blkio:

[preflight] Running pre-flight checks
        [WARNING SystemVerification]: missing optional cgroups: blkio
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

The StatefulSet definition:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: myrelease
    meta.helm.sh/release-namespace: myns
  labels:
    app: myapp
    app.kubernetes.io/managed-by: Helm
  name: mysts
  namespace: myns
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: myapp
  serviceName: ""
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: myapp
    spec:
      containers:
        image: adminserver:5000/sampleapp:1.0.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 10
          periodSeconds: 20
          successThreshold: 1
          tcpSocket:
            port: 8080
          timeoutSeconds: 5
        name: myapp
        ports:
        - containerPort: 8080
          name: server
          protocol: TCP
        resources:
          limits:
            memory: 1124Mi
          requests:
            cpu: 500m
            memory: 1Gi
        securityContext:
          allowPrivilegeEscalation: false
        startupProbe:
          failureThreshold: 10
          initialDelaySeconds: 50
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 8080
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/data
          name: datavolume
      dnsPolicy: Default
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsUser: 1000
      serviceAccount: service1
      serviceAccountName: service1
      terminationGracePeriodSeconds: 30
      volumes:
       - name: datavolume
        persistentVolumeClaim:
          claimName: datapvc
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: datapvc
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: hpe-storageclass
      volumeMode: Filesystem
    status:
      phase: Pending

When the container is creating the TrueNas side looks OK. The Targets, Extents and Associated Extents are there along with the Initiaror. Portal configuration looks OK too. Communication with TrueNas seems OK. Attached images for reference.

The Storage class definition (API key cleared, although the device is no exposed to the Internet):

---
apiVersion: v1
kind: Secret
metadata:
  name: truenas-secret
  namespace: hpe-storage
stringData:
  serviceName: truenas-csp-svc
  servicePort: "8080"
  username: k8sapikey
  password: <my TrueNas API Key>
  backend: 192.168.1.230
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  name: hpe-storageclass
provisioner: csi.hpe.com
parameters:
  csi.storage.k8s.io/controller-expand-secret-name: truenas-secret
  csi.storage.k8s.io/controller-expand-secret-namespace: hpe-storage
  csi.storage.k8s.io/controller-publish-secret-name: truenas-secret
  csi.storage.k8s.io/controller-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/node-publish-secret-name: truenas-secret
  csi.storage.k8s.io/node-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/node-stage-secret-name: truenas-secret
  csi.storage.k8s.io/node-stage-secret-namespace: hpe-storage
  csi.storage.k8s.io/provisioner-secret-name: truenas-secret
  csi.storage.k8s.io/provisioner-secret-namespace: hpe-storage
  csi.storage.k8s.io/fstype: xfs
  # csi.storage.k8s.io/fstype: ext4
  allowOverrides: sparse,compression,deduplication,volblocksize,sync,description
  root: k8spool/k8s
  description: "Volume created by the HPE CSI Driver for Kubernetes"
  accessProtocol: iscsi
reclaimPolicy: Delete
allowVolumeExpansion: true

The PV and PVC created.

╭╴  dev on  K8sLab(myns) at …/deployment/k8s/helm is 📦 v1.0.0 via  v3.10.2 took 3s
╰─ k get pv,pvc
NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM          STORAGECLASS       REASON   AGE
persistentvolume/pvc-2fed2d40-e721-4846-8671-2d7040e8a7ca   1Gi        RWO            Delete           Bound    myns/datapvc   hpe-storageclass            2d16h

NAME                                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
persistentvolumeclaim/datapvc   Bound    pvc-2fed2d40-e721-4846-8671-2d7040e8a7ca   1Gi        RWO            hpe-storageclass   2d16h

Multipath seems to be OK on the node too:

root@k8sw2:/home/pgsadmin# multipath -l
mpathf (36589cfc000000110a23aaee10b0dd1b4) dm-0 TrueNAS,iSCSI Disk
size=1.0G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
  `- 1:0:0:0 sdb 8:16 active undef running

ISCSI discovery is OK too:

root@k8sw2:/home/pgsadmin# iscsiadm -m discovery -t sendtargets -p 192.168.1.230
192.168.1.230:3260,1 iqn.2011-08.org.truenas.ctl:pvc-2fed2d40-e721-4846-8671-2d7040e8a7ca

I am out of ideas why the volume can not mount. Any help is appreciated!

Attachments: image image(4) image(3) image(2) image(1)

CSI driver log file from the node where the volume should be mounted: hpe-csi-node-99kgz.log

Log files from containers of the csi controller and csi node: hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261246107108993.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261253968234771.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261262104117602.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261269954362084.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261277590043596.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261288083010354.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261296357269650.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261301719055280.log hpe-storage-hpe-csi-controller-65f4f97cb7-jrr5g-1672261313704105071.log hpe-storage-hpe-csi-node-99kgz-1672260886767868056.log hpe-storage-hpe-csi-node-99kgz-1672260898900143158.log hpe-storage-hpe-csi-node-99kgz-1672260937175006234.log hpe-storage-hpe-csi-node-99kgz-1672260945656099461.log

datamattsson commented 1 year ago

Thanks for reporting this. There's a LOG_LEVEL environment variable on the CSI node driver that needs to be set to "trace".

kubectl -n hpe-storage edit ds/hpe-csi-node -n hpe-storage

Restart the experiment and post the log file.

Could you also double-check that your "Global Target Configuration" base name is set to iqn.2011-08.org.truenas.ctl as prescribed here: https://github.com/hpe-storage/truenas-csp/blob/master/INSTALL.md#configure-truenasfreenas

klacus commented 1 year ago

Hi Michael,

Thanks for looking into this!

Yes, the "Base Name" under the "Target Global Configuration" on TreueNAS Scale is set to "iqn.2011-08.org.truenas.ctl".

Attached the log files after setting the LOG_LEVEL to "trace" and restarted the entire K8s cluster. The "hpe-storage/truenas-csp-6f9bb9b94f-m99sq" pod does not have any log entry.

hpe-storage-hpe-csi-node-4qlx9-1673485064248579168 (1).log hpe-storage-hpe-csi-controller-fbdf874d7-lblpf-1673484953953757288.log

211942742-2bef0402-d0df-4b7f-8932-bf11bd2522e0

211951089-450e1f4d-b261-42c4-a209-70094aa391cf

Screenshot 2023-01-11 202242

Screenshot 2023-01-11 202334

datamattsson commented 1 year ago

Thanks for the update. I find this very strange. Can you post what your /etc/multipath.conf file looks like on the node? You can also delete your /etc/multipath.conf file to have one rewritten by the node driver (requires node driver restart).

klacus commented 1 year ago

After letting the node driver recreate the /etc/multipath.conf, the PV mounting started to work as expected.

Thanks for driving me to the right direction to fix my lab configuration!

It was indeed a misconfiguration in my lab system.

This was the /etc/multipath.conf on the worker node originally (assuming default Ubuntu) :

root@k8sw1:/etc# cat multipath.conf
defaults {
    find_multipaths     no
    user_friendly_names yes
}
devices {
    device {
        hardware_handler     "1 alua"
        dev_loss_tmo         infinity
        path_checker         tur
        product              "Server"
        prio                 alua
        vendor               "Nimble"
        path_grouping_policy group_by_prio
        fast_io_fail_tmo     5
        failback             immediate
        path_selector        "service-time 0"
        no_path_retry        30
    }
    device {
        path_grouping_policy group_by_prio
        path_checker         tur
        failback             immediate
        hardware_handler     "1 alua"
        product              "VV"
        vendor               "3PARdata"
        getuid_callout       "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
        features             "0"
        checker              tur
        path_selector        "round-robin 0"
        rr_min_io            100
        no_path_retry        18
        prio                 alua
    }
}

After deleting it, it changed to this at next reboot of the node:

root@k8sw1:/etc# cat ./multipath.conf
defaults {
    user_friendly_names yes
    find_multipaths     no
    uxsock_timeout      10000
}
blacklist {
    devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
    devnode "^hd[a-z]"
    device {
        product ".*"
        vendor  ".*"
    }
}
blacklist_exceptions {
    property "(ID_WWN|SCSI_IDENT_.*|ID_SERIAL)"
    device {
        vendor  "Nimble"
        product "Server"
    }
    device {
        vendor  "3PARdata"
        product "VV"
    }
    device {
        vendor  "TrueNAS"
        product "iSCSI Disk"
    }
    device {
        product "iSCSI Disk"
        vendor  "FreeNAS"
    }
}
devices {
    device {
        vendor               "Nimble"
        path_checker         tur
        no_path_retry        30
        prio                 alua
        product              "Server"
        rr_weight            uniform
        dev_loss_tmo         infinity
        path_grouping_policy group_by_prio
        hardware_handler     "1 alua"
        fast_io_fail_tmo     5
        rr_min_io_rq         1
        path_selector        "service-time 0"
        failback             immediate
    }
    device {
        getuid_callout       "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
        checker              tur
        vendor               "3PARdata"
        path_selector        "round-robin 0"
        path_grouping_policy group_by_prio
        no_path_retry        18
        failback             immediate
        rr_min_io            100
        path_checker         tur
        features             "0"
        product              "VV"
        hardware_handler     "1 alua"
        prio                 alua
    }
    device {
        rr_weight            priorities
        uid_attribute        ID_SERIAL
        vendor               "TrueNAS"
        product              "iSCSI Disk"
        path_grouping_policy group_by_prio
        path_selector        "queue-length 0"
    }
    device {
        path_grouping_policy group_by_prio
        path_selector        "queue-length 0"
        hardware_handler     "1 alua"
        rr_weight            priorities
        uid_attribute        ID_SERIAL
        vendor               "FreeNAS"
        product              "iSCSI Disk"
    }
}