pop-os / nvidia-docker

Packaging for https://github.com/NVIDIA/nvidia-docker
Apache License 2.0
2 stars 2 forks source link

Failed to Create Shim Task: OCI Runtime Create Failed #3

Open k-e-i-z-a-i opened 2 years ago

k-e-i-z-a-i commented 2 years ago

I followed the NVIDIA Container Toolkit installation guide to install this on version 21.10 of Pop OS, but after following the guide and running sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi I get the following error message:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

My understanding is that, because I'm running Pop OS, it's this Project's version of the Container Toolkit that was installed on my changed.

How can this issue be fixed?

For additional background, here's the first line of what I get when I run nvidia-smi on my machine:

NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4

bassemkaroui commented 2 years ago

I have the same issue. Unfortunately it doesn't seem that nvidia-docker2 downloaded from pop os repo is working. I installed nvidia-docker2 from nvidia official repo and it worked fine. Follow these instructions if you don't know how to install it https://github.com/pop-os/pop/issues/1708#issuecomment-877830843

marksumm commented 2 years ago

Broken on 22.04 too. Installing latest *nvidia-container* packages from the official repo fixes it.

berkgercek commented 1 year ago

There's an issue on the nvidia-docker repo referencing this exact problem. @mmstick @elezar you appear to be the recent maintainers for this repo, would it be possible to implement a fix? The last response on that issue seems to identify the issue as a compile-time option in libnvidia-container, which I have also opened an issue on.

elezar commented 1 year ago

@berkgercek which issue to you mean in the libnvidia-container repo? Note that I am a maintainer in the upstream (NVIDIA) repo and not the pop fork.

mmstick commented 1 year ago

@berkgercek @elezar NVCGO is disabled because it fails to compile when enabled.

libnvidia-container/src/cgroup.c:31:16: error: variable ‘res’ has initializer but incomplete type
   31 |         struct nvcgo_get_device_cgroup_version_res res = {0};
libnvidia-container/src/cgroup.c:31:52: error: storage size of ‘res’ isn’t known
   31 |         struct nvcgo_get_device_cgroup_version_res res = {0};
libnvidia-container/src/cgroup.c:37:46: warning: implicit declaration of function ‘nvcgo_get_device_cgroup_version_1’; did you mean ‘get_device_cgroup_version’? [-Wimplicit-function-declaration]
   37 |         if (call_rpc(err, &nvcgo->rpc, &res, nvcgo_get_device_cgroup_version_1, (char*)proc_root, cnt->cfg.pid) < 0)
libnvidia-container/src/error.h:28:9: error: static assertion failed: "incompatible alignment"
   28 |         static_assert(alignof(*err) == alignof(*xdr), "incompatible alignment");  \
libnvidia-container/src/cgroup.c:182:91: error: unknown type name ‘nvcgo_setup_device_cgroup_res’
  182 | nvcgo_setup_device_cgroup_1_svc(ptr_t ctxptr, int dev_cg_version, char *dev_cg, dev_t id, nvcgo_setup_device_cgroup_res *res, maybe_unused struct svc_req *req)