qnap-dev / QNAP-CSI-PlugIn

Apache License 2.0
28 stars 3 forks source link

Failed to create backend - media is not ready or Auth is not passed #15

Open louhisuo opened 2 weeks ago

louhisuo commented 2 weeks ago

Having QNAP TS-673A running QuTS hero h5.2.0.2860. QNAP CSI Plugin version is 1.3.0. Kubernetes version is 1.30.3 (Talos Linux).

Trying to initialize qnap csi plugin against following backend image image

using this trident backend configuration.

apiVersion: v1
kind: Secret
metadata:
  name: qnap-iscsi-backend
  namespace: trident
type: Opaque
stringData:
  username: qnap
  password: qnapqnapqnap
  storageAddress: 172.16.1.21
---
apiVersion: trident.qnap.io/v1
kind: TridentBackendConfig
metadata:
  name: qnap-iscsi-backend
  namespace: trident
spec:
  version: 1
  storageDriverName: qnap-iscsi
  storagePrefix: talos-
  backendName: talos-cluster
  credentials:
    name: qnap-iscsi-backend
  debugTraceFlags:
    method: true
  storage:
    - serviceLevel: Any
      labels:
        performance: any
    - serviceLevel: RAID0-SSDCache
      labels:
        performance: premium
      features:
        ssdCache: "true"
    - serviceLevel: RAID0
      labels:
        performance: standard
      features:
        raidLevel: "0"
    - serviceLevel: RAID1
      labels:
        performance: basic
      features:
        raidLevel: "1"

When describing TridentBackendConfig seeing following errors and also errors in in trident-controller pod storage-api-server and trident-main container logs.

  Warning  Failed  6m56s (x9 over 11m)    trident-crd-controller  Failed to create backend: problem initializing storage driver 'qnap-iscsi': rpc error: code = Unknown desc = login failed; please check if storage is online
  Warning  Failed  117s (x8 over 12m)     trident-crd-controller  Failed to create backend: problem initializing storage driver 'qnap-iscsi': rpc error: code = Unknown desc = media is not ready or Auth is not passed

.. and ip address of kubernetes cluster gets added to IP block list.

Note also that Talos Linux has has implemented some kubernetes security hardening by default and I get following type of warnings when deploying plugin as well as when restarting deployments and daemonset

Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "storage-api-server", "trident-main", "csi-provisioner", "csi-attacher", "csi-resizer", "csi-snapshotter" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "storage-api-server", "trident-main", "csi-provisioner", "csi-attacher", "csi-resizer", "csi-snapshotter" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (container "trident-main" must not set securityContext.runAsNonRoot=false), seccompProfile (pod or containers "storage-api-server", "trident-main", "csi-provisioner", "csi-attacher", "csi-resizer", "csi-snapshotter" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/trident-controller restarted

Please advice if this is fault or configuration mistake.

LeonaChen2727 commented 2 weeks ago

Please use the NAS's account and password, not CHAP A/P.

Thanks.

louhisuo commented 2 weeks ago

Thank you. Based on above I managed to make progress, however I am now hitting now another issue which looks very similar to pvc is created but pod is unable to mount the volume (#13). I am also running Talos Linux single node cluster.

louhisuo commented 2 weeks ago

I made some progress and can configure backend with previously defined TridentBackEnd configuration. Now facing issue where pod is not able to consume PVC and is getting stuck in status ContainerCreating.

I am having following StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: quts-hero-ssd-raid1
provisioner: csi.trident.qnap.io
parameters:
  selector: "performance=basic"
allowVolumeExpansion: true

following PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: quts-hero-test-pvc
spec:
  storageClassName: quts-hero-ssd-raid1
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

and following Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-one
spec:
  replicas: 1
  selector:
    matchLabels:
      app: multi-deployment
  template:
    metadata:
      labels:
        app: multi-deployment
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        volumeMounts:
        - name: storage
          mountPath: /tmp/k8s
      volumes:
      - name: storage
        persistentVolumeClaim:
          claimName: quts-hero-test-pvc

I see following event in pod

Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  41s (x115 over 26h)  kubelet  MountVolume.MountDevice failed for volume "pvc-b271b1cd-03f6-4c32-a0cb-33a5edf2a7c7" : rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: exit status 2

and following logged in trident-node-linux pod.

time="2024-08-27T12:35:57Z" level=debug msg="<<<< devices.getDeviceInfoForLUN" iSCSINodeName="iqn.2004-04.com.qnap:ts-673a:iscsi.iscsi-talos--pvc-b271b1cd-03f6-4c32-a0cb-33a5edf2a7c7.82aad6" logLayer=csi_frontend lunID=1 needFSType=false requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="Found device." devices="[sda]" iqn="iqn.2004-04.com.qnap:ts-673a:iscsi.iscsi-talos--pvc-b271b1cd-03f6-4c32-a0cb-33a5edf2a7c7.82aad6" logLayer=csi_frontend multipathDevice= requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI scsiLun=1 workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="Device found." device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg=">>>> devices.getDeviceFSType" device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="Device found." device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg=">>>> command.ExecuteWithTimeout." args="[/dev/sda]" command=blkid logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="<<<< command.ExecuteWithTimeout." logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=info msg="Could not get FSType for device; err: exit status 2." device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="<<<< devices.getDeviceFSType" logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg=">>>> devices.isDeviceUnformatted" device=/dev/sda logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg=">>>> command.ExecuteWithTimeout." args="[if=/dev/sda bs=4096 count=512 status=none]" command=dd logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=debug msg="<<<< command.ExecuteWithTimeout." logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"
time="2024-08-27T12:35:57Z" level=error msg="failed to read the device" device=/dev/sda error="exit status 2" logLayer=csi_frontend requestID=078c660a-a0ab-4333-8520-9a9720e229ff requestSource=CSI workflow="node_server=stage"

Do I have configuration problem or is this fault?

davidcheng0716 commented 2 weeks ago

hi @louhisuo,

Talos is a minimal Linux OS, and it lacks some basic utilities (like dd and others) that are typically found in most Linux systems.

Our service assumes that these tools are available on the node, so if they are missing, attempting to use them could lead to errors.

We are aware that Talos might support Linux utility extensions, which could potentially help you install the required utilities.

Thank you.

Ref :

https://github.com/siderolabs/extensions?tab=readme-ov-file

https://github.com/siderolabs/extensions/tree/main/tools/util-linux

louhisuo commented 2 weeks ago

I have added util-linux-tools talos extension to the cluster (see below)

% talosctl get extensions
NODE           NAMESPACE   TYPE              ID   VERSION   NAME               VERSION
172.16.1.244   runtime     ExtensionStatus   0    1         iscsi-tools        v0.1.4
172.16.1.244   runtime     ExtensionStatus   1    1         qemu-guest-agent   8.2.2
172.16.1.244   runtime     ExtensionStatus   2    1         util-linux-tools   2.39.3
172.16.1.244   runtime     ExtensionStatus   3    1         schematic          88d1f7a5c4f1d3aba7df787c448c1d3d008ed29cfb34af53fa0df4336a56040b

The issue still remains (logs from trident-node-linux pod)

time="2024-08-28T12:09:38Z" level=debug msg="Found device." devices="[sda]" iqn="iqn.2004-04.com.qnap:ts-673a:iscsi.iscsi-talos--pvc-b4bad894-6ae3-438c-815c-6d7649c6ed54.82aad6" logLayer=csi_frontend multipathDevice= requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI scsiLun=1 workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="Device found." device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg=">>>> devices.getDeviceFSType" device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="Device found." device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg=">>>> command.ExecuteWithTimeout." args="[/dev/sda]" command=blkid logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="<<<< command.ExecuteWithTimeout." logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=info msg="Could not get FSType for device; err: exit status 2." device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="<<<< devices.getDeviceFSType" logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg=">>>> devices.isDeviceUnformatted" device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg=">>>> command.ExecuteWithTimeout." args="[if=/dev/sda bs=4096 count=512 status=none]" command=dd logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="<<<< command.ExecuteWithTimeout." logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=error msg="failed to read the device" device=/dev/sda error="exit status 2" logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="<<<< devices.isDeviceUnformatted" logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=error msg="Unable to identify if the device is unformatted; err: exit status 2" device=/dev/sda logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="<<<< iscsi.AttachISCSIVolume" logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"
time="2024-08-28T12:09:38Z" level=debug msg="Attach iSCSI volume is not complete, waiting." error="exit status 2" increment=5.533169717s logLayer=csi_frontend requestID=e68df392-5823-4884-83d6-ec5539266468 requestSource=CSI workflow="node_server=stage"

Are you expecting some specific linux tool to be available on node? If it is dd then my understanding is that dd is not part of util-linux-tool but coreutils instead and Talos does not have extension which delivers coreutilspackage.

louhisuo commented 2 weeks ago

Is this issue Support for Talos (#806) perhaps a reason why QNAP CSI Plugin does not work with Talos Linux?

brunnels commented 1 week ago

@louhisuo looks like you arrived at the same point I did. The next thing I was going to do was build a talos extension for coreutils similar to the util-linux one. It doesn't look that difficult to get going and it should be possible to deploy it as a github package I just haven't had time to do it yet.

LeonaChen2727 commented 1 week ago

Is this issue Support for Talos (#806) perhaps a reason why QNAP CSI Plugin does not work with Talos Linux?

Yes, this issue has the same root cause as ours. The unavailability of certain utilities like dd on the node causes the plugin to be unusable.

You can reference the document we provided earlier for Linux utility extension or seek help from Talos.

louhisuo commented 1 week ago

Yes, it looks like we both are hitting same issue @brunnels. Looking forward to have coreutils extension for Talos OS which would bring availability of dd command into Talos OS. My concern here is what other tools, we do not know, are missing in Talos OS as their design principle has been to remove everything from OS which is not required to run Kubernetes.

louhisuo commented 1 week ago

@davidcheng0716, if QNAP is serious to position their NAS products as Kubernetes storage QNAP needs to consider to make to investments in this area.

(1) Refactor QNAP CSI Driver as OS agnostic by including all needed tools into CSI driver. With this approach it will be easier for QNAP support wide range of operating systems and kubernetes distributions with minimal effort. (2) Create user documentation which describes how driver should be configured to work with QNAP NAS boxes. This will reduce support efforts from QNAP engineers and increases adoption for QNAP as Kubernetes storage. (3) Add support for other storage technologies available in QNAP NAS boxes (Samba, NFS, S3)

QNAP is way behind Synology regarding this (see below). And to be very direct it is very hard to recommend QNAP as Kubernetes storage when comparing what Synology can offer. Synology CSI Driver for Kubernetes iSCSI Storage with Synology CSI