siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.68k stars 535 forks source link

Unable to mount PVC in a pod after provisioning zfs-generic-nvmeof volume with democratic-csi #9255

Closed linucksrox closed 1 month ago

linucksrox commented 1 month ago

Sorry if this is the wrong place to post this, it may not be a bug in Talos Linux but I'm not sure where to look.

Bug Report

I'm using democratic-csi with ZFS on Linux and can successfully create a PVC bound to a PV. Now when I attempt to mount the volume in a pod, it's stuck with the error MountVolume.MountDevice failed for volume "pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7" : rpc error: code = Unknown desc = unable to attach any nvme devices

Description

I believe Talos nodes already have the nvme modules built in to connect to an external nvme over TCP mount. In my case I'm running a separate TrueNAS Scale instance and I manually configured it as the root user for nvmeof. I tested manually from a separate Linux VM and was able to add the nvme share and mount it, along with reading/writing files to it.

With democratic-csi, I can create a PVC and it successfully provisions a PV and binds it:

root@localvm:~/# k get pvc
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     VOLUMEATTRIBUTESCLASS   AGE
testpvc   Bound    pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7   5Gi        RWO            truenas-nvmeof   <unset>                 45m
root@localvm:~/# k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS     VOLUMEATTRIBUTESCLASS   REASON   AGE
pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7   5Gi        RWO            Delete           Bound    default/testpvc   truenas-nvmeof   <unset>
   45m

But running a test pod, it's stuck on the error message above. Here's the pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: testlogger
spec:
    containers:
    - name: testlogger
      image: alpine:3.20
      command: ["/bin/ash"]
      args: ["-c", "while true; do echo \"$(date) - test log\" >> /mnt/test.log && sleep 1; done"]
      volumeMounts:
      - name: testvol
        mountPath: /mnt
    volumes:
    - name: testvol
      persistentVolumeClaim:
        claimName: testpvc

Logs

MountVolume.MountDevice failed for volume "pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7" : rpc error: code = Unknown desc = unable to attach any nvme devices

Environment

smira commented 1 month ago

I wonder if it's a duplicate of https://github.com/siderolabs/talos/issues/9214

linucksrox commented 1 month ago

I saw that issue, but in my case I would need the nvme_tcp module. How can I verify it's enabled, and if not, how can I get it added?

smira commented 1 month ago

They were missing in v1.7, and added in the latest release 1.8.0-alpha.2

linucksrox commented 1 month ago

I just added a worker node with 1.8.0-alpha.2 and moved my test pod to that node. I'm still seeing the same error rpc error: code = Unknown desc = unable to attach any nvme devices

Are there any logs I could be checking to get more details on why this isn't connecting? Update: I just noticed logs on the dashboard that say nvme nvme0: failed to connect socket: -111 so maybe I'm on the right path.

smira commented 1 month ago

Try loading that module via machine config, but I certainly don't know much about democratic-csi, and what logs you should be checking.

linucksrox commented 1 month ago

I also don't know a ton about democratic-csi, but at this point I've determined that the node can't connect due to that kernel error -111 which means connection refused. I tested this exact PV by mounting it manually somewhere else, and the steps there are:

sudo apt install nvme-cli
sudo modprobe nvme_tcp
sudo nvme discover -t tcp -a 10.0.50.99 -s 4420
sudo nvme connect -t tcp -n nqn.2003-1.org.linux-nvme:default-testpvc -a 10.0.50.99 -s 4420

Any way I can attempt the same process in a shell or something on the Talos node, so I can troubleshoot in a more familiar way?

smira commented 1 month ago

Everything in your list should just work from a debug pod, except for the module loading (which Talos restricts).

See my response above: load the kernel module by patching machine config with this:

machine:
    kernel:
        modules:
            - name: nvme_tcp

Run a debug pod (not a Talos thing, but Kubernetes), and do your other stuff:

$ kubectl debug -it node/talos-default-worker-1 --image alpine -n kube-system --profile sysadmin
linucksrox commented 1 month ago

Thank you for all the help, and I fully realize this might not be a Talos issue. I forgot to mention that I added the nvme_tcp kernel module via machine config in my last response, and rebooted to be sure it was applied. When I go into the alpine container now, install nvme-cli, and try the discover command it returns

/ # nvme discover -t tcp -a 10.0.50.99 -s 4420
Failed to open ctrl nvme0, errno 2
failed to identify controller, error Bad file descriptor
failed to add controller, error Bad file descriptor

I don't think this means anything is wrong in Talos, but I just don't know how to troubleshoot at this point. I looked up the first line in the error message but can't find anything.

I'm also wondering if this means it's unable to add the device at /dev/nvme0n1 since that's what would normally happen. Is it possible that Talos could be preventing the device from being added? I'm grasping at straws here :)

smira commented 1 month ago

I don't know, but errno 2 should be "doesn't exist". Not sure what doesn't exist mean in this sense.

You can try running strace and see what nvme actually fails on exactly?

Talos shouldn't prevent any operations from happening at this point, but I'm not exactly sure.

linucksrox commented 1 month ago

What stands out to me in the strace that exists on my test machine is this line: open("/sys/class/nvme/nvme0/tls_key", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)

I added these 3 kernel modules in Talos: nvme_core, nvme_tcp, and nvme_fabrics

On my test machine, there are 2 additional modules nvme_auth and nvme_keyring which are not available in Talos currently: "error": "error loading module \"nvme_auth\": module not found\nerror loading module \"nvme_keyring\": module not found"} Any chance that's why the tls_key can't be found?

smira commented 1 month ago

Nobody requested that I guess... (no real reason)

smira commented 1 month ago

In fact, NVME_AUTH should be compiled in (doesn't need a module), and NVME_KEYRING seems to be 6.7+ thing, it's not available in Linux 6.6.

smira commented 1 month ago

Ref: https://cateee.net/lkddb/web-lkddb/NVME_KEYRING.html

https://github.com/siderolabs/pkgs/blob/4ce5bc6bbb87f1feeabadc90ef304e4f16c6da8f/kernel/build/config-amd64#L2055

linucksrox commented 1 month ago

Interesting. How do we make /sys/class/nvme/nvme0/tls_key show up? Or maybe that's completely unrelated to my issue. I have no idea at this point.

smira commented 1 month ago

My guess so far it's something new and not available yet in Linux 6.6 Talos is using.

linucksrox commented 1 month ago

I set up an Ubuntu VM and installed kernel 6.6. I was able to load nothing but the nvme_tcp kernel module, then the discover command worked as expected. I confirmed that neither nvme_auth or nvme_keyring modules exist, and neither does /sys/class/nvme/nvme0/tls_key so that is not causing my issue.

Any other ideas how can we make Talos connect to nvme over TCP? I don't think it has anything to do specifically with democratic-csi (but I could be wrong). I'm also assuming this already worked in the past based on the recommendation to use Mayastor, but maybe that was SPDK or something different than my use case.

I'm thinking about testing in a Talos dev environment but need to find time for that.

smira commented 1 month ago

I can't say that I know for sure, probably you can try older Alpine image with older version of nvme tools?

linucksrox commented 1 month ago

I tried with an Alpine 3.14 container and it gave a slightly different error but I think it's basically the same failure. strace on either version reveals that the error 2 comes from being unable to open /dev/nvme0.

It seems like the debug container is not allowed to add the block device /dev/nvme0 which is a requirement to work with NVMe-oF. From everything I've tested, this appears to be a limitation in Talos and I'm hoping you can help determine how to make this work. Is it possible this is partly related to https://github.com/siderolabs/talos/issues/8367 ?

I even tested with a VM of alpine 3.19 (with kernel 6.6) and don't experience this issue.

frezbo commented 1 month ago

you might need a super privilged container with /dev mounted up

linucksrox commented 1 month ago

Thanks, that actually works when mounting /dev and running a privileged container. How does that relate back to the fact that Talos is not able to attach to the PVC? Does that mean the Talos container needs to mount /dev or run as privileged, or something else?

smira commented 1 month ago

I doesn't relate in any way, it just shows there is no issue with Talos itself.

Mounting /dev in a container is always required, it's not related to Talos.

The issue should be tracked from the CSI side - what exactly fails there.

linucksrox commented 1 month ago

Thanks for all the help! I have learned a lot, and I really appreciate the help working through how to troubleshoot and prove where the issue is. I'll dig in on the CSI side and see if I can find a solution.

I also tested and confirmed this is working on Talos 1.7.5