libGL error: MESA-LOADER: failed to retrieve device information

woensug-choi commented 1 year ago

On freshly install Ubuntu 22.04 Jammy LTS. Without doing anything, I've installed rocker with,

pip3 install rocker
pip3 install --force-reinstall git+https://github.com/osrf/rocker.git@main
rocker --version
# rocker 0.2.12

and ran Example in README

rocker --nvidia --x11 osrf/ros:noetic-desktop-full gazebo

and Got error saying

libGL error: MESA-LOADER: failed to retrieve device information

I was able to fix the problem by adding --volume /dev:/dev in rocker argument. which adds -v /dev:/dev to docker argument.

rocker --volume /dev:/dev --nvidia --x11 osrf/ros:noetic-desktop-full gazebo

tfoote commented 1 year ago

On your fresh install. Do you have the NVIDIA drivers installed And you should make sure that you've installed and setup nvidia-docker or now the NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ?

woensug-choi commented 1 year ago

Yes. I had toolkit installed.

tfoote commented 9 months ago

As far as I can tell your device isn't being mounted correctly. Your solution of mounting all of /dev tells me that the device is available. It's just a matter of understanding what your device is and making sure to mount it. Mounting all devices is too broad a brush. More specific feedback in #258 which I closed as too broad a solution. But with a more targeted fix we could add a solution.

noah-curran commented 7 months ago

I'll preface this that I'm a Docker noob so its entirely possible I'm doing something wrong... but I've been having this issue with my fresh install as well. I've spent several hours over the past 5 days on this so I've ruled out many of the common points of advice and came here to discover this chain.

Similar to @woensug-choi I'm using Ubuntu 22.04 and Nvidia driver version 535. I have an RTX 4070 and understand that 535 is not a tested driver version, but my GPU does not support the maximum driver version tested, 470.

Before coming to rocker, I have been experimenting with mounting individual devices in /dev instead of mounting the whole directory like @woensug-choi has suggested in his solution. Doing this while I boot up some example dockers to inspect the issues I notice two things: (1) I need to add --device /dev/nvidiactl. I'm uncertain why because before doing this a simple ls into /dev shows that this device is present before adding this to the docker line... but without this step I get the notorious Failed to initialize NVML: Unknown Error if I try running nvidia-smi. After adding nvidiactl, I get No devices were found, but at least nvidia-smi works. This leads to (2) I need to add --device /dev/nvidia0:/dev/nvidia0 to the docker line. After this, it works as expected.

My solution is a bit less general than @woensug-choi since it is directly resolving the pain points that I discovered, but I don't think its quite where it needs to be to merge into rocker since I imagine it will fail for users who have more than one GPU. Maybe this info will help guide this issue.

FWIW I do not believe this is a rocker-specific issue. It appears to be either a docker issue or an nvidia-docker issue. I think --gpus all is what should make all of this work, but for whatever reason it has just lead to broken mounts. I have yet to dive deeper into the code of docker to understand what that flag is actually doing so I can't comment on it further besides for having a hunch about it being the root of the issue.

# nvidia_extension.py
# ...
class X11(RockerExtension):
    @staticmethod
    def get_name():
        return 'x11'

    def __init__(self):
        self.name = X11.get_name()
        self._env_subs = None
        self._xauth = None

    def get_docker_args(self, cliargs):
        assert self._xauth, 'xauth not initialized, get_docker_args must be called after precodition_environment'
        xauth = self._xauth.name
        return "  -e DISPLAY -e TERM \
  -e QT_X11_NO_MITSHM=1 \
  -e XAUTHORITY=%(xauth)s -v %(xauth)s:%(xauth)s \
  -v /tmp/.X11-unix:/tmp/.X11-unix \

#####
# Here is where I have my changes.

  --device /dev/nvidiactl \
  --device /dev/nvidia0:/dev/nvidia0 \

#####

  -v /etc/localtime:/etc/localtime:ro " % locals()
# ...

tfoote commented 6 months ago

Thanks for the extra info and debugging. That sounds parallel to the need for Intel integrated /dev/dri/card0 It seems like the different cards/drivers for NVIDIA may need different devices mounted.

osrf / rocker

libGL error: MESA-LOADER: failed to retrieve device information #257