Open woensug-choi opened 1 year ago
On your fresh install. Do you have the NVIDIA drivers installed And you should make sure that you've installed and setup nvidia-docker or now the NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ?
Yes. I had toolkit installed.
As far as I can tell your device isn't being mounted correctly. Your solution of mounting all of /dev tells me that the device is available. It's just a matter of understanding what your device is and making sure to mount it. Mounting all devices is too broad a brush. More specific feedback in #258 which I closed as too broad a solution. But with a more targeted fix we could add a solution.
I'll preface this that I'm a Docker noob so its entirely possible I'm doing something wrong... but I've been having this issue with my fresh install as well. I've spent several hours over the past 5 days on this so I've ruled out many of the common points of advice and came here to discover this chain.
Similar to @woensug-choi I'm using Ubuntu 22.04 and Nvidia driver version 535. I have an RTX 4070 and understand that 535 is not a tested driver version, but my GPU does not support the maximum driver version tested, 470.
Before coming to rocker, I have been experimenting with mounting individual devices in /dev
instead of mounting the whole directory like @woensug-choi has suggested in his solution. Doing this while I boot up some example dockers to inspect the issues I notice two things: (1) I need to add --device /dev/nvidiactl
. I'm uncertain why because before doing this a simple ls
into /dev
shows that this device is present before adding this to the docker line... but without this step I get the notorious Failed to initialize NVML: Unknown Error
if I try running nvidia-smi
. After adding nvidiactl
, I get No devices were found
, but at least nvidia-smi
works. This leads to (2) I need to add --device /dev/nvidia0:/dev/nvidia0
to the docker line. After this, it works as expected.
My solution is a bit less general than @woensug-choi since it is directly resolving the pain points that I discovered, but I don't think its quite where it needs to be to merge into rocker since I imagine it will fail for users who have more than one GPU. Maybe this info will help guide this issue.
FWIW I do not believe this is a rocker-specific issue. It appears to be either a docker issue or an nvidia-docker issue. I think --gpus all
is what should make all of this work, but for whatever reason it has just lead to broken mounts. I have yet to dive deeper into the code of docker to understand what that flag is actually doing so I can't comment on it further besides for having a hunch about it being the root of the issue.
# nvidia_extension.py
# ...
class X11(RockerExtension):
@staticmethod
def get_name():
return 'x11'
def __init__(self):
self.name = X11.get_name()
self._env_subs = None
self._xauth = None
def get_docker_args(self, cliargs):
assert self._xauth, 'xauth not initialized, get_docker_args must be called after precodition_environment'
xauth = self._xauth.name
return " -e DISPLAY -e TERM \
-e QT_X11_NO_MITSHM=1 \
-e XAUTHORITY=%(xauth)s -v %(xauth)s:%(xauth)s \
-v /tmp/.X11-unix:/tmp/.X11-unix \
#####
# Here is where I have my changes.
--device /dev/nvidiactl \
--device /dev/nvidia0:/dev/nvidia0 \
#####
-v /etc/localtime:/etc/localtime:ro " % locals()
# ...
Thanks for the extra info and debugging. That sounds parallel to the need for Intel integrated /dev/dri/card0 It seems like the different cards/drivers for NVIDIA may need different devices mounted.
On freshly install Ubuntu 22.04 Jammy LTS. Without doing anything, I've installed rocker with,
and ran Example in README
and Got error saying
I was able to fix the problem by adding
--volume /dev:/dev
in rocker argument. which adds-v /dev:/dev
to docker argument.Related articles https://github.com/osrf/rocker/issues/206 https://github.com/kinu-garage/hut_10sqft/issues/819