No --device=/dev/dri, glxgears is not running on nvidia gpu

LiangHuangBC commented 3 years ago

My host machine is ubuntu18.04 with 4 nvidia gpus. And I don't have /dev/dri node, with nvidia driver, I have /dev/nvidia0, /dev/nvidia1, /dev/nvidia2, /dev/nvidia3 instead. I can docker run, and connect vnc, but can not use gpu.

I can run glxgears, or vglrun glxgears, but not on gpu. Command to start container is :
docker run --gpus 1 -it -e SIZEW=1920 -e SIZEH=1080 -e CDEPTH=24 -e SHARED=TRUE -e VNCPASS=vncpasswd -p 5901:5901 -v "/tmp/.docker.xauth:/tmp/.docker.xauth" -e XAUTHORITY=/tmp/.docker.xauth --name vv --rm --runtime=nvidia ehfd/nvidia-egl-desktop:latest

ehfd commented 3 years ago

Yeah. llvmpipe means you are not using the GPU. You need the /dev/dri DRM devices and it is supposed to be available if your Kernel Modesetting is set up correctly during NVIDIA driver install. Probably reinstall NVIDIA driver & blacklist the Nouveau open source driver or install the NVIDIA driver with your OS's own package manager, then enable the Direct Rendering Manager. Not a container issue. Will keep it open for more questions. NOTE: https://download.nvidia.com/XFree86/Linux-x86_64/470.42.01/README/kms.html

LiangHuangBC commented 3 years ago

Thanks for quick reply. Your another image ehfd/docker-nvidia-glx-desktop is working well, so we are on it now. Thanks for your work.

ehfd commented 3 years ago

The ehfd/docker-nvidia-egl-desktop container is guaranteed to work on multiple desktops over ehfd/docker-nvidia-glx-desktop (which uses quite hacky configurations to use Xorg inside containers) and is the recommended container to use. But ehfd/docker-nvidia-egl-desktop doesn't support Vulkan. If you need Vulkan use https://github.com/mviereck/x11docker, but you need to use docker on one single node with full root control and cannot be used on other container orchestration platforms. The ehfd/docker-nvidia-glx-desktop container is a limited functionality container designed for multi-node clusters with container orchestration or when root access is not available until Xwayland support for NVIDIA proprietary drivers are available.

TROUBLESHOOTING: this problem you likely did not disable Nouveau properly or the kernel modules for your NVIDIA drivers are set up incorrectly. It is recommended to use the package managers to install the drivers instead of installing with .run files.

Uninstall NVIDIA drivers completely and reinstall (if you don't know how to uninstall, install the whole OS again), preferably with the package manager of the Linux distribution with sudo or with root and then reboot. Check if the NVIDIA DRM module is enabled with lsmod | grep nvidia.drm. The below instruction is for newer cards and there may be a different package version (e.g. 390.xx) that needs to be set for discontinued GPUs.

For Ubuntu use ubuntu-drivers autoinstall to install recommended drivers automatically or apt-get install nvidia-driver-460 (this package) for example if you want to specify a specific NVIDIA driver version. Similar for Debian (https://wiki.debian.org/NvidiaGraphicsDrivers#Installation).

For Arch Linux use pacman -S nvidia lib32-nvidia-utils or pacman -S nvidia-lts lib32-nvidia-utils for linux-lts kernels. For Manjaro mhwd -a pci nonfree 0300. Enable DRM if you checked lsmod | grep nvidia.drm after reboot and it was not already enabled https://wiki.archlinux.org/index.php/NVIDIA#DRM_kernel_mode_setting.

For OpenSUSE or SLES go to "Software Repositories" in YaST2. Click "ADD" on the bottom left and "Next" after selecting "Community Repositories". Select "NVIDIA" and click "OK". Trust the repository and go to "Online Update". click on "Extras" and select "Install All Matching Recommended Packages". This pulls the recommended NVIDIA drivers automatically. Check https://en.opensuse.org/SDB:NVIDIA_drivers for command-line installation.

For RHEL, CentOS, or Fedora, use https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation. You don't need to install CUDA if you don't want to as the CUDA library will not be used by containers and you need nvidia/cuda or containers based on it instead.

If you really need to install drivers using the .run file (and you are prepared to reinstall your operating system very frequently), consult https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile, except you just have to follow the steps after downloading NVIDIA-Linux-x86_64-<version>.run instead of cuda_<version>_linux.run as you don't have to install CUDA.

On any distribution, select "yes" ESPECIALLY when you see something similar to the below and you should also select "yes" or "default" for other options you don't know during driver installation:

Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later.

You should have the DRM devices available after reboot if you selected yes for this.

If the above options was not applied or you have installed the NVIDIA drivers with the .run file, you might need this (NOT RECOMMENDED because you must rebuild initramfs every time when the kernel is updated, https://wiki.archlinux.org/index.php/NVIDIA#DRM_kernel_mode_setting):

nvidia 364.16 adds support for DRM (Direct Rendering Manager) kernel mode setting. To enable this feature, add the nvidia-drm.modeset=1 kernel parameter. For basic functionality that should suffice, if you want to ensure it's loaded at the earliest possible occasion, or are noticing startup issues (such as the nvidia kernel module being loaded after the Display manager) you can add nvidia, nvidia_modeset, nvidia_uvm and nvidia_drm to the initramfs according to Mkinitcpio#MODULES. If added to the initramfs do not forget to run mkinitcpio every time there is a nvidia driver update.

mkinitcpio in Arch and Manjaro is equivalent to update-initramfs in Ubuntu.

NOTE: https://download.nvidia.com/XFree86/Linux-x86_64/470.42.01/README/kms.html

Note you also require the NVIDIA container toolkit to successfully use containers using NVIDIA GPUs.

LiangHuangBC commented 3 years ago

Thanks a lot for your help, I tried your glx image, which is good but only one people can use it. The thing is after run that glx image, /dev/dri appears on host machine (side effect of --privileged?? ), now we can use this egl image. Now we are running gazebo and ros on it, lucky~~. Another thing is now I need vulkan for lgsvl simulator, then we ordered new machines for it. Hopefully Nvidia will release new driver soon, we can go back to containers' world.

LeehanLee commented 2 years ago

Thanks a lot for your help, I tried your glx image, which is good but only one people can use it. The thing is after run that glx image, /dev/dri appears on host machine (side effect of --privileged?? ), now we can use this egl image. Now we are running gazebo and ros on it, lucky~~. Another thing is now I need vulkan for lgsvl simulator, then we ordered new machines for it. Hopefully Nvidia will release new driver soon, we can go back to containers' world.

hi @LiangHuangBC , you mentioned you have ran gazebo on this docker-nvidia-egl-desktop container, did you see the render area of gazebo keeps flickering?

https://user-images.githubusercontent.com/14244974/140729589-3852fbf1-b4e9-41af-ba13-235382339eeb.mp4

ehfd commented 2 years ago

/dev/dri is no longer required for NVIDIA GPUs. Use VGL_DISPLAY=egl[n] or default.

selkies-project / docker-nvidia-egl-desktop

No --device=/dev/dri, glxgears is not running on nvidia gpu #8