Closed linucksrox closed 3 months ago
I wonder if it's a duplicate of https://github.com/siderolabs/talos/issues/9214
I saw that issue, but in my case I would need the nvme_tcp module. How can I verify it's enabled, and if not, how can I get it added?
They were missing in v1.7, and added in the latest release 1.8.0-alpha.2
I just added a worker node with 1.8.0-alpha.2 and moved my test pod to that node. I'm still seeing the same error rpc error: code = Unknown desc = unable to attach any nvme devices
Are there any logs I could be checking to get more details on why this isn't connecting?
Update: I just noticed logs on the dashboard that say nvme nvme0: failed to connect socket: -111
so maybe I'm on the right path.
Try loading that module via machine config, but I certainly don't know much about democratic-csi
, and what logs you should be checking.
I also don't know a ton about democratic-csi, but at this point I've determined that the node can't connect due to that kernel error -111 which means connection refused. I tested this exact PV by mounting it manually somewhere else, and the steps there are:
sudo apt install nvme-cli
sudo modprobe nvme_tcp
sudo nvme discover -t tcp -a 10.0.50.99 -s 4420
sudo nvme connect -t tcp -n nqn.2003-1.org.linux-nvme:default-testpvc -a 10.0.50.99 -s 4420
Any way I can attempt the same process in a shell or something on the Talos node, so I can troubleshoot in a more familiar way?
Everything in your list should just work from a debug pod, except for the module loading (which Talos restricts).
See my response above: load the kernel module by patching machine config with this:
machine:
kernel:
modules:
- name: nvme_tcp
Run a debug pod (not a Talos thing, but Kubernetes), and do your other stuff:
$ kubectl debug -it node/talos-default-worker-1 --image alpine -n kube-system --profile sysadmin
Thank you for all the help, and I fully realize this might not be a Talos issue. I forgot to mention that I added the nvme_tcp kernel module via machine config in my last response, and rebooted to be sure it was applied. When I go into the alpine container now, install nvme-cli, and try the discover command it returns
/ # nvme discover -t tcp -a 10.0.50.99 -s 4420
Failed to open ctrl nvme0, errno 2
failed to identify controller, error Bad file descriptor
failed to add controller, error Bad file descriptor
I don't think this means anything is wrong in Talos, but I just don't know how to troubleshoot at this point. I looked up the first line in the error message but can't find anything.
I'm also wondering if this means it's unable to add the device at /dev/nvme0n1 since that's what would normally happen. Is it possible that Talos could be preventing the device from being added? I'm grasping at straws here :)
I don't know, but errno 2
should be "doesn't exist". Not sure what doesn't exist mean in this sense.
You can try running strace
and see what nvme actually fails on exactly?
Talos shouldn't prevent any operations from happening at this point, but I'm not exactly sure.
What stands out to me in the strace that exists on my test machine is this line: open("/sys/class/nvme/nvme0/tls_key", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
I added these 3 kernel modules in Talos: nvme_core
, nvme_tcp
, and nvme_fabrics
On my test machine, there are 2 additional modules nvme_auth
and nvme_keyring
which are not available in Talos currently: "error": "error loading module \"nvme_auth\": module not found\nerror loading module \"nvme_keyring\": module not found"}
Any chance that's why the tls_key can't be found?
Nobody requested that I guess... (no real reason)
In fact, NVME_AUTH
should be compiled in (doesn't need a module), and NVME_KEYRING
seems to be 6.7+ thing, it's not available in Linux 6.6.
Interesting. How do we make /sys/class/nvme/nvme0/tls_key
show up? Or maybe that's completely unrelated to my issue. I have no idea at this point.
My guess so far it's something new and not available yet in Linux 6.6 Talos is using.
I set up an Ubuntu VM and installed kernel 6.6. I was able to load nothing but the nvme_tcp
kernel module, then the discover command worked as expected. I confirmed that neither nvme_auth
or nvme_keyring
modules exist, and neither does /sys/class/nvme/nvme0/tls_key
so that is not causing my issue.
Any other ideas how can we make Talos connect to nvme over TCP? I don't think it has anything to do specifically with democratic-csi (but I could be wrong). I'm also assuming this already worked in the past based on the recommendation to use Mayastor, but maybe that was SPDK or something different than my use case.
I'm thinking about testing in a Talos dev environment but need to find time for that.
I can't say that I know for sure, probably you can try older Alpine image with older version of nvme tools?
I tried with an Alpine 3.14 container and it gave a slightly different error but I think it's basically the same failure. strace
on either version reveals that the error 2 comes from being unable to open /dev/nvme0
.
It seems like the debug container is not allowed to add the block device /dev/nvme0
which is a requirement to work with NVMe-oF. From everything I've tested, this appears to be a limitation in Talos and I'm hoping you can help determine how to make this work. Is it possible this is partly related to https://github.com/siderolabs/talos/issues/8367 ?
I even tested with a VM of alpine 3.19 (with kernel 6.6) and don't experience this issue.
you might need a super privilged container with /dev
mounted up
Thanks, that actually works when mounting /dev
and running a privileged container. How does that relate back to the fact that Talos is not able to attach to the PVC? Does that mean the Talos container needs to mount /dev
or run as privileged, or something else?
I doesn't relate in any way, it just shows there is no issue with Talos itself.
Mounting /dev
in a container is always required, it's not related to Talos.
The issue should be tracked from the CSI side - what exactly fails there.
Thanks for all the help! I have learned a lot, and I really appreciate the help working through how to troubleshoot and prove where the issue is. I'll dig in on the CSI side and see if I can find a solution.
I also tested and confirmed this is working on Talos 1.7.5
Sorry if this is the wrong place to post this, it may not be a bug in Talos Linux but I'm not sure where to look.
Bug Report
I'm using democratic-csi with ZFS on Linux and can successfully create a PVC bound to a PV. Now when I attempt to mount the volume in a pod, it's stuck with the error
MountVolume.MountDevice failed for volume "pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7" : rpc error: code = Unknown desc = unable to attach any nvme devices
Description
I believe Talos nodes already have the nvme modules built in to connect to an external nvme over TCP mount. In my case I'm running a separate TrueNAS Scale instance and I manually configured it as the root user for nvmeof. I tested manually from a separate Linux VM and was able to add the nvme share and mount it, along with reading/writing files to it.
With democratic-csi, I can create a PVC and it successfully provisions a PV and binds it:
But running a test pod, it's stuck on the error message above. Here's the pod yaml:
Logs
MountVolume.MountDevice failed for volume "pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7" : rpc error: code = Unknown desc = unable to attach any nvme devices
Environment