Closed shawnmjones closed 2 years ago
Running sudo kernelstub -a systemd.unified_cgroup_hierarchy=0
will revert to cgroups v1 and get the behavior nvidia-container-toolkit currently expects. That's the only change necessary at the moment, until NVIDIA releases 1.8.0 with the cgroupv2 fixes.
Thanks for the insight! I put the config back and ran the kernelstub command that you recommended. That did the trick.
Running
sudo kernelstub -a systemd.unified_cgroup_hierarchy=0
will revert to cgroups v1 and get the behavior nvidia-container-toolkit currently expects. That's the only change necessary at the moment, until NVIDIA releases 1.8.0 with the cgroupv2 fixes.
Hello, I apologies I am very naive but wonder at what stage I should put this command, I just tried to run a docker with the command
tensorman run --gpu --python3 --jupyter bash
but I am faced with the error
Error response from daemon: failed to create shim: OCI runtime create failed:
and nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
Apologies for my naivety, I'm pretty new to Linux and dockers.
Solved.
I am using a GRUB bootloader rather than a kernal stub, this is because I am dualbooting pop os with on a razer laptop with an RTX3070.
sudo nano /etc/default/grub
# add this line : GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"
sudo update-grub
sudo reboot
This solved it for me.
I have a workaround for this issue and it took some doing to get right. I felt the need to report this because Tensorman no longer works as expected out of the box. At a minimum, I hope reporting this issue might help someone else.
Hardware specs:
OS specs:
Linux shai-hulud 5.15.15-76051515-generic #202201160435~1642693824~21.10~97db1bb SMP Thu Jan 20 17:35:05 U x86_64 x86_64 x86_64 GNU/Linux
Description of problem: I upgraded from Pop!_OS 21.04 to 21.10 in December. Two weeks ago, I noticed that Pop!_Shop would not let me upgrade everything. It gave me the following error:
I removed
nvidia-container-toolkit
andnvidia-container-runtime
and installednvidia-container-toolkit
andnvidia-docker2
in their place, as suggested byapt install
. This got rid of the error from Pop!_Shop.This is my Tensorman test script:
I ran the Tensorman test script and got the following output:
I reviewed this nvidia-docker issue and decided to change
# no-cgroups = false
tono-cgroups = true
in/etc/nvidia-container-runtime/config.toml
.This got me closer.
However, it still states no CUDA-capable device is detected.
From that same nvidia-docker issue I found other docker commands that executed
nvidia-smi
in a container. I tried them on my Thelio and they did return information about the GPU from the container, so I surmised that perhaps Tensorman was not communicating this device information.Workaround: I placed a
Tensorman.toml
in the same directory ashello-world.py
with the following content:and now I get this output:
which, in spite of the numerous successful NUMA node read from SysFS had negative value (-1) warnings, is the same output I was getting before the upgrade, and the output produces the expected information about the GPU.
I tried uninstalling and reinstalling tensorman and it did not fix the issue. I don't mind specifying this in
Tensorman.toml
, but it did not require this previously.At a minimum, the documentation at https://support.system76.com/articles/tensorman/ should be updated for the new Nvidia packages.
Has anyone else noticed this? Is this the intended workaround going forward or do we need to fix the defaults to match what I've put in my Tensorman.toml?