j3soon / ros2-essentials

A repo containing essential ROS2 Humble features for controlling Autonomous Mobile Robots (AMRs) and robotic arm manipulators.
https://j3soon.github.io/ros2-essentials/
Apache License 2.0
10 stars 4 forks source link

Unable to access GPU when VirtualGL is installed #45

Closed j3soon closed 3 months ago

j3soon commented 3 months ago

The following error would occur when trying to access the GPU inside a Docker container with non-root user on a system with VirtualGL installed:

$ nvidia-smi
Failed to initialize NVML: Insufficient Permissions

This is mentioned by @YuZhong-Chen on August 9th. And reproduced by @ClassLongJoe1112 and @j3soon

j3soon commented 3 months ago

This is because VirtualGL, by default, changes the group ownership of GPU devices to vglusers. For an example:

$ ls -l /dev | grep nvidia
drwxr-xr-x   2 root root           80 Aug  4 19:26 nvidia-caps
crw-rw----   1 root vglusers 195, 254 Aug  4 19:26 nvidia-modeset
crw-rw-rw-   1 root root     510,   0 Aug  4 19:26 nvidia-uvm
crw-rw-rw-   1 root root     510,   1 Aug  4 19:26 nvidia-uvm-tools
crw-rw----   1 root vglusers 195,   0 Aug  4 19:26 nvidia0
crw-rw----   1 root vglusers 195,   1 Aug  4 19:26 nvidia1
crw-rw----   1 root vglusers 195, 255 Aug  4 19:26 nvidiactl

To resolve this, there are two potential solutions.

  1. Ask the server admin/IT to re-configure VirtualGL to not change the group ownership.
    • Run /opt/VirtualGL/bin/vglserver_config and unconfigure VirtualGL (ref)
    • Run /opt/VirtualGL/bin/vglserver_config and re-configure VirtualGL without changing the group ownership by setting No (n) for the following two options (ref):
      Restrict 3D X server access to vglusers group (recommended)?
      Restrict framebuffer device access to vglusers group (recommended)?
  2. Modify Dockerfile to add the default user to vglusers group. However, since the vglusers group may have different Group ID (GID) across different machines, this approach cannot be made portable.

I believe the first solution is the best option, since the second solution is not portable across different machines.

However, it is worth noting that the first solution requires all users on the server to be trusted. If there exist untrusted users, the first solution may cause security risks, and you may prefer the second solution/workaround.

j3soon commented 3 months ago

Some related references:

KuanYuChang commented 3 months ago

Hi @j3soon,

Please append the following argument to the command docker run when running a container on a VirtualGL-installed system.

--group-add $(getent group vglusers | cut -d: -f3)

Regards, Kuan-Yu

j3soon commented 3 months ago

Hi @KuanYuChang,

Thanks for sharing this. I hadn’t thought of adding a group when launching the container, which can access the GPU without modifying the Dockerfile.


Hi @YuZhong-Chen,

I think we can keep the Dockerfile intact, and use a hardcoded group ID in the compose.yaml on that specific machine for now. See this docs for adding the vglusers GID.

I'm thinking of a portable way to support this, which may be achieved through the following shell command:

(getent group vglusers || echo user:x:1000) | cut -d: -f3

which outputs the vglusers GID if it exists, and outputs 1000 otherwise.

Ref: https://stackoverflow.com/a/69987399

However, docker compose files doesn't seem to allow shell script expansion. We may need to use a wrapper for docker compose up to achieve this, which may be an overkill since VirtualGL may not exist in most systems.

Ref: https://github.com/docker/compose/issues/4081

I think we can leave this to the users when they're using machines with VirtualGL, and ask them to add the hardcoded GID in the docker compose files.

We can come back to this issue later if someone come up with a portable solution for docker compose. Thanks!

YuZhong-Chen commented 3 months ago

Self note:

The devcontainer will update the container's user UID and GID to match the local user. This will avoid permission problems with bind mounts. After hardcoding the group ID in the compose.yaml file, if you want to use devcontainer to launch the container, remember to add "updateRemoteUserUID": false, in the devcontainer.json file to prevent devcontainer update your UID and GID. Ref