osrf / rocker

A tool to run docker containers with overlays and convenient options for things like GUIs etc.
Apache License 2.0
532 stars 68 forks source link

Add `-v /dev:/dev` to X11 default argument to fix libGL error: MESA-LOADER #258

Closed woensug-choi closed 4 months ago

woensug-choi commented 7 months ago

On freshly install Ubuntu 22.04 Jammy LTS. Without doing anything, I've installed rocker with,

pip3 install rocker
pip3 install --force-reinstall git+https://github.com/osrf/rocker.git@main
rocker --version
# rocker 0.2.12

and ran Example in README

rocker --nvidia --x11 osrf/ros:noetic-desktop-full gazebo

and Got error saying

libGL error: MESA-LOADER: failed to retrieve device information

I was able to fix the problem by adding --volume /dev:/dev in rocker argument. which adds -v /dev:/dev to docker argument.

rocker --volume /dev:/dev --nvidia --x11 osrf/ros:noetic-desktop-full gazebo

I believe the right position to add -v /dev:/dev is --x11 argument tag since it wouldn't break even if /dev doesn't exist.

Related articles https://github.com/osrf/rocker/issues/257 https://github.com/osrf/rocker/issues/206 https://github.com/kinu-garage/hut_10sqft/issues/819

tfoote commented 7 months ago

That's an improperly broad solution for getting nvidia to work. In particular you say this is on a fresh installation. Do you have your NVIDIA drivers setup and docker-nvidia setup as well? The fact that it's trying to use MESA suggest that you it's not detecting NVIDIA. And the fix for a non-NVIDIA gpu is usually to use --device /dev/dri/card0 I would suggest that you try that instead of mounting the full /dev If that doesn't fix it you should find out the specific device that you need.

woensug-choi commented 7 months ago

That's an improperly broad solution for getting nvidia to work. In particular you say this is on a fresh installation. Do you have your NVIDIA drivers setup and docker-nvidia setup as well? The fact that it's trying to use MESA suggest that you it's not detecting NVIDIA. And the fix for a non-NVIDIA gpu is usually to use --device /dev/dri/card0 I would suggest that you try that instead of mounting the full /dev If that doesn't fix it you should find out the specific device that you need.

I will try --device /dev/dri/card0 meanwhile, I did install NVIDIA driver and toolkit installed. nvidia-smi shows fine. Also glxinfo shows no particular error message. glxgears works fast.

woensug-choi commented 7 months ago

Yes, --device /dev/dri/card0 solves problem! but without it I get the same error. Does this mean that we still need a PR to add --device /dev/dri/card0? I thought gpus -all whould deal with this.

woensug-choi commented 7 months ago

Hmm. More notes,

both with and without --device /dev/dri/card0, when running nvidia-smi inside the container does get correct NVIDIA GPU. But opening gazebo doesn't work without --device /dev/dri/card0

tested with,

rocker --nvidia --x11 osrf/ros:noetic-desktop-full /bin/bash
nvidia-smi

prints

Wed Nov 29 20:51:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 3000 with Max...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P0              22W /  65W |      5MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

when running gazebo,

gazebo
# libGL error: MESA-LOADER: failed to retrieve device information

It's a fresh install of Ubuntu 22.04 with recommended NVIDIA driver installation which were selected automatically during installation process.

tfoote commented 7 months ago

I thought gpus -all whould deal with this.

My understanding is that the --gpus all only catches discrete GPUs I believe. /dev/dri/card0 is the integrated intel graphics If you're using that I don't believe that you're using the NVIDIA card.

with recommended NVIDIA driver installation

This may be a generic ubuntu recommendation not necessarly what you want

I see that you're using the 535 driver. I have only ever tested up to the 470 NVIDIA driver. (See the README) There's a level of required compatibility between the internal and external drivers and going to the

Are you using the 535-open drivers? https://forums.linuxmint.com/viewtopic.php?t=401149

I see reported issues with 535 reported here too: https://github.com/NVIDIA/nvidia-docker/issues/1767

woensug-choi commented 7 months ago

The 535 NVIDIA driver wasn't open driver version apart from other open versions of drivers in Additional Drivers in ubuntu 22.04. I've tested with 470.223.02 the proprietary version of NVIDIA driver (I did reboot by the way), but also didnt' work. Screenshot from 2023-11-30 13-16-55

tfoote commented 7 months ago

I'm not sure what I can do to help you. I can't reproduce your issue. Does Gazebo run on the host machine with NVIDIA support?

tfoote commented 4 months ago

https://github.com/containers/podman/issues/7801#issuecomment-722574489

I found another pattern of potential devices that might be a solution instead of /dev/dri/card0 /dev/dri/renderD128 Also I see that the p16s can also have nvidia chips so that might mount up in a different location. Because this is too generic I don't want to merge it as proposed so I'm going to close this. But if there's a more specific device that we can detect and mount if needed that would be a good reason to reopen this.