Open k-e-i-z-a-i opened 2 years ago
I have the same issue. Unfortunately it doesn't seem that nvidia-docker2
downloaded from pop os repo is working. I installed nvidia-docker2
from nvidia official repo and it worked fine.
Follow these instructions if you don't know how to install it https://github.com/pop-os/pop/issues/1708#issuecomment-877830843
Broken on 22.04 too. Installing latest *nvidia-container* packages from the official repo fixes it.
There's an issue on the nvidia-docker repo referencing this exact problem. @mmstick @elezar you appear to be the recent maintainers for this repo, would it be possible to implement a fix? The last response on that issue seems to identify the issue as a compile-time option in libnvidia-container
, which I have also opened an issue on.
@berkgercek which issue to you mean in the libnvidia-container
repo? Note that I am a maintainer in the upstream (NVIDIA) repo and not the pop fork.
@berkgercek @elezar NVCGO is disabled because it fails to compile when enabled.
libnvidia-container/src/cgroup.c:31:16: error: variable ‘res’ has initializer but incomplete type
31 | struct nvcgo_get_device_cgroup_version_res res = {0};
libnvidia-container/src/cgroup.c:31:52: error: storage size of ‘res’ isn’t known
31 | struct nvcgo_get_device_cgroup_version_res res = {0};
libnvidia-container/src/cgroup.c:37:46: warning: implicit declaration of function ‘nvcgo_get_device_cgroup_version_1’; did you mean ‘get_device_cgroup_version’? [-Wimplicit-function-declaration]
37 | if (call_rpc(err, &nvcgo->rpc, &res, nvcgo_get_device_cgroup_version_1, (char*)proc_root, cnt->cfg.pid) < 0)
libnvidia-container/src/error.h:28:9: error: static assertion failed: "incompatible alignment"
28 | static_assert(alignof(*err) == alignof(*xdr), "incompatible alignment"); \
libnvidia-container/src/cgroup.c:182:91: error: unknown type name ‘nvcgo_setup_device_cgroup_res’
182 | nvcgo_setup_device_cgroup_1_svc(ptr_t ctxptr, int dev_cg_version, char *dev_cg, dev_t id, nvcgo_setup_device_cgroup_res *res, maybe_unused struct svc_req *req)
I followed the NVIDIA Container Toolkit installation guide to install this on version 21.10 of Pop OS, but after following the guide and running
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
I get the following error message:docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
My understanding is that, because I'm running Pop OS, it's this Project's version of the Container Toolkit that was installed on my changed.
How can this issue be fixed?
For additional background, here's the first line of what I get when I run
nvidia-smi
on my machine:NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4