qnap-dev / QNAP-CSI-PlugIn

Apache License 2.0
28 stars 3 forks source link

pvc is created but pod is unable to mount the volume #13

Open brunnels opened 3 months ago

brunnels commented 3 months ago

Latest QTS5 version

Events:
  Type     Reason                  Age               From                     Message
  ----     ------                  ----              ----                     -------
  Normal   Scheduled               105s              default-scheduler        Successfully assigned default/write-ntswf to tc1
  Normal   SuccessfulAttachVolume  104s              attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884"
  Warning  FailedMount             3s (x5 over 81s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884" : rpc error: code = Internal desc = failed to stage volume: exit status 2

Logs from trident controller don't appear to have anything wrong

time="2024-06-08 23:40:25" level=info msg="---> Receive Storage Add Request"
time="2024-06-08 23:40:25" level=info msg="Add Storage Successfully" Host=10.60.0.5 Name=qts-nasf586af
time="2024-06-08 23:40:29" level=info msg="---> Receive Qvolume Create Request"
time="2024-06-08 23:40:37" level=info msg="Create Volume Successfully" Capacity=5 Index=0 Name=trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884
time="2024-06-08 23:40:37" level=info msg="---> Receive Iscsi Create Request"
time="2024-06-08 23:40:39" level=info msg="Create Iscsi Target Successfully" Alias=iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884 IQN="iqn.2004-04.com.qnap:tvs-871u-rp:iscsi.iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884.f586af" Index=0 Name=iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884
time="2024-06-08 23:40:39" level=info msg="---> Receive Iscsi Map Request"
time="2024-06-08 23:40:41" level=info msg="Map Volume and Iscsi Target Successfully" Iscsi Target Index=0 Volume Index=0
time="2024-06-08 23:40:41" level=info msg="---> Receive Qvolume FullGet Request"
time="2024-06-08 23:43:00" level=info msg="---> Receive Qvolume List Request"
time="2024-06-08 23:43:00" level=info msg="List Volume Successfully"
time="2024-06-08 23:43:00" level=info msg="---> Receive Qvolume FullGet Request"
time="2024-06-08 23:45:41" level=info msg="---> Receive Qvolume List Request"
time="2024-06-08 23:45:41" level=info msg="List Volume Successfully"
time="2024-06-08 23:45:41" level=info msg="---> Receive Qvolume FullGet Request"

trident-node-linux logs have errors

time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.getDeviceInfoForLUN" iSCSINodeName="iqn.2004-04.com.qnap:tvs-871u-rp:iscsi.iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884.f586af" logLayer=csi_frontend lunID=0 needFSType=false requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Found device." devices="[sdb]" iqn="iqn.2004-04.com.qnap:tvs-871u-rp:iscsi.iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884.f586af" logLayer=csi_frontend multipathDevice= requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI scsiLun=0 workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Device found." device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.getDeviceFSType" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Device found." device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[/dev/sdb]" command=blkid logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeoutAndInput." args="[/dev/sdb]" command=blkid logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeoutSeconds=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeoutAndInput." command=blkid error="exit status 2" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeout." logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=info msg="Could not get FSType for device; err: exit status 2." device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.getDeviceFSType" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.isDeviceUnformatted" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[if=/dev/sdb bs=4096 count=512 status=none]" command=dd logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeoutAndInput." args="[if=/dev/sdb bs=4096 count=512 status=none]" command=dd logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeoutSeconds=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeoutAndInput." command=dd error="exit status 2" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeout." logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=error msg="failed to read the device" device=/dev/sdb error="exit status 2" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.isDeviceUnformatted" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=error msg="Unable to identify if the device is unformatted; err: exit status 2" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< iscsi.AttachISCSIVolume" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Attach iSCSI volume is not complete, waiting." error="exit status 2" increment=5.51696103s logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
davidcheng0716 commented 2 months ago

Hi @brunnels,

Can you exec the command "dd if=/dev/sdb bs=4096 count=512 status=none" on the node ?

The function just execute the above command and get the result.

brunnels commented 2 months ago

The node is talos so no ssh access. I can exec on the trident-node-linux pod if that would help but logs indicate it would probably fail. All my talos nodes have the iscsi-tools system extension.

brunnels commented 2 months ago

@davidcheng0716 It looks like the synology-csi that uses iscsi for RWO and cifs for RWM works well under talos. https://www.talos.dev/v1.7/kubernetes-guides/configuration/synology-csi/

Perhaps the qnap-csi needs some additional configuration steps like the openebs jiva as detailed here?

brunnels commented 2 months ago

@davidcheng0716 I found this project talos-debug-tools that creates a daemonset similar to the qnap csi driver. I changed the namespace to be the same as the qnap csi driver namespace and I am able to ssh into the pod and run the dd command without error.

Maybe the debug daemonset could provide some insight into something the qnap csi daemonset might be missing?

brunnels commented 2 months ago

Comparing the daemonset-trident-node-linux and the debug daemonset the only things that pop out is the pod securityContext is this in the debug daemonset

        securityContext:
          allowPrivilegeEscalation: true
          capabilities:
            add:
              - SYS_ADMIN
          privileged: true

and this for the spec.template.spec.securityContext

      securityContext:
        runAsNonRoot: false
        seccompProfile:
          type: RuntimeDefault
LeonaChen2727 commented 2 months ago

Hi @brunnels, We have received your report and suspect the issue may be caused by a specific Linux distribution. We are actively working to resolve this problem.

brunnels commented 2 months ago

@davidcheng0716 Do you plan to release the source for the operator? If the source were available I could make the necessary changes and test.

brunnels commented 1 month ago

@davidcheng0716 any movement on this? I'm happy to assist with testing.

davidcheng0716 commented 1 month ago

@brunnels Sorry for the late reply, and I'll take a look at this issue.  To reproduce the issue, I'm going to build the same environment; what kind of installed talos do you have? Bare metal, virtualbox, or others? 

brunnels commented 1 month ago

@davidcheng0716 I have baremetal. I'm using https://github.com/onedr0p/cluster-template template. This makes it very quick to get a repeatable, secure, and best practices cluster up in a few minutes.

davidcheng0716 commented 1 month ago

@brunnels I guess the problem is the node doesn't have the tool 'dd'; our service doesn't have 'dd' in the container.

The talos debug daemonset installs the tools in the container instead of calling the tool on the node. Therefore, it can work when you ssh into the pod.

I'm wondering if the node has the tool 'dd' without the container on the node.

brunnels commented 1 month ago

@davidcheng0716 no, that's not the problem. If you see my prior comment I can use the talos debug daemonset to get into a pod in the same namespace on the same node and I can run dd on the iscsi pvc that qnap created. I think the issue lies with the pod securityContext.

By default talos security policy prevents privileged escalation in the pod but can be overridden with a namespace annotation and adding the capabilites https://www.talos.dev/v1.7/kubernetes-guides/configuration/pod-security/

I have the annotation on the qnap-trident namespace but I cannot change the pod securityContext set but the qnap-trident operator.

brunnels commented 1 month ago

@davidcheng0716 I just re-read you comment and looked at the debug-tools dockerfile and you are correct. It is installing dd into the pod. My apologies.

The node won't have dd. It's the very bare minimum so anything the pod needs it must provide.