mviereck / x11docker

Run GUI applications and desktops in docker and podman containers. Focus on security.
MIT License
5.66k stars 377 forks source link

GPU-related device nodes appear in container invocation even when --gpu is not specified #455

Open Intensity opened 2 years ago

Intensity commented 2 years ago

When I don't specifically add --gpu (or even when I add --gpu=no) I find various GPU/DRM devices appear in the container command line.

I'm wondering if this is intentional. In some cases I may prefer not to pass in devices if I won't be making use of them.

Although I am not completely clear how to validate what's actually passed in. I'm not sure if there are other transforms on the container command line prior to running it, but if I don't specifically add --gpu I might prefer that those devices not get passed in altogether. I'm wondering if there is meant to be some conditional login on their inclusion and the conditions weren't added in certain sections of x11docker.

In my case I'm running podman user mode, although I'm guessing this sort of thing would be easy to reproduce as it seems to be logic-related.

mviereck commented 2 years ago

When I don't specifically add --gpu (or even when I add --gpu=no) I find various GPU/DRM devices appear in the container command line.

Could you please give me an example? A command and the output of --debug that shows the generated docker/podman command.

If you have image x11docker/xserver, you might be confused by the command that runs the X container. It contains GPU devices, while the container for the desired command does not.

Example:

$ x11docker --gpu=no --debug --desktop x11docker/xfce

[...]

DEBUGNOTE[09:47:44,138]: X container command (rootless no):
  docker run --pull=never \
  --rm \
  --detach \
  --name x11docker_X128_xserver_39260387129 \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/share,target=/home/lauscher/.cache/x11docker/39260387129-xfce/share \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/etcpasswd.xcontainer,target=/etc/passwd,readonly \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/etcgroup.xcontainer,target=/etc/group,readonly \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/xcontainerrc,target=/xcontainerrc,readonly \
  --security-opt label=type:container_runtime_t \
  --ipc=shareable \
  --runtime runc \
  --cap-drop ALL \
  --security-opt=no-new-privileges \
  --user 1000:1000 \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/tmp,target=/tmp \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.server,target=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.server \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/modelines,target=/home/lauscher/.cache/x11docker/modelines,readonly \
  --env DISPLAY=:0.0 \
  --mount type=bind,source=/tmp/.X11-unix/X0,target=/X0,readonly \
  --env XAUTHORITY=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.host.0-0 \
  --mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.host.0-0,target=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.host.0-0 \
  --env LD_PRELOAD=/lib/x86_64-linux-gnu/libdl.so.2:/home/lauscher/.cache/x11docker/39260387129-xfce/share/XlibNoSHM.so  \
  --device /dev/dri/card0:/dev/dri/card0 \
  --device /dev/dri/renderD128:/dev/dri/renderD128 \
  --device /dev/vga_arbiter:/dev/vga_arbiter \
  --group-add 44 \
  --group-add 133 \
  x11docker/xserver bash /xcontainerrc

[...]

DEBUGNOTE[09:47:48,118]: docker command (rootless no):
  /usr/bin/docker run \
  --pull never \
  --rm \
  --tty \
  --name x11docker_X128_x11docker-xfce_39260387129 \
  --user 1000:1000 \
  --userns=host \
  --group-add 1000 \
  --runtime='runc' \
  --network none \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  --security-opt label=type:container_runtime_t \
  --mount type=bind,source='/usr/bin/tini-static',target='/usr/local/bin/init',readonly \
  --tmpfs /run:exec \
  --tmpfs /run/lock \
  --tmpfs /tmp \
  --mount type=bind,source='/home/lauscher/.cache/x11docker/39260387129-xfce/share',target='/x11docker' \
  --mount type=bind,source='/home/lauscher/.cache/x11docker/39260387129-xfce/tmp/.X11-unix/X128',target='/tmp/.X11-unix/X128',readonly \
  --ipc=container:x11docker_X128_xserver_39260387129 \
  --workdir '/tmp' \
  --entrypoint env \
  --env 'container=docker' \
  --env 'XAUTHORITY=/x11docker/Xauthority.client' \
  --env 'DISPLAY=:128' \
  --env 'USER=lauscher' \
  -- x11docker/xfce /usr/local/bin/init -- /bin/sh - /x11docker/containerrc

I'm not sure if there are other transforms on the container command line prior to running it

The command shown in the --debug output is the one that is finally executed.

if I don't specifically add --gpu I might prefer that those devices not get passed in altogether.

That is entirely right. And it is part of the concept of x11docker to only pass what is needed. Just a note, I know it for a short time myself: Even if the device files like in /dev/dri are not shared, they are still accessible somewhere below /sys in container. This affects docker as well as podman. So far, I only found glxinfo and glxgears that access the device files this way. (Found with strace.) For example, try:

$ x11docker --weston-xwayland x11docker/xfce glxgears

You'll find it to be GPU accelerated although /dev/dri is not shared. It even doesn't matter if the user is in groups video and render. This is something I find disturbing. I wonder if this should be reported as a bug somewhere. But where? Does it even go to kernel level?.

Intensity commented 2 years ago

Hi. The output you pasted above (with the /dev/dri devices passed in) shows what I've also been seeing. Although in a quick test, I commented out the setup_gpu_devicelist function contents entirely (so that those devices aren't named in the X container) and x11docker still created a container (granted, I didn't test very much in it). Anyhow perhaps those devices aren't needed then if I'm able to comment out their inclusion in the X container.

I overlooked or forgot the fact that there may have been two containers running (the X container, and the command container). Although I also wasn't sure why ps output didn't show most of those parameters still living on the command line. Anyhow while it's good to note that the devices aren't passed into the command container, I'm wondering if they need to be passed into the X container, especially when the user doesn't actively request the GPU inclusion.

That's a good point about potentially unnecessary access being exposed through /sys which I believe is a pretty powerful interface. Although when you note that some DRI access is exposed in /sys, where in the path is that? I'm asking because I'm less familiar with what to look for. And for your case, is it showing on a /sys that's mounted as read-only or one that's mounted read-write? While I'd prefer no access to what's needed, still read-only is better than read-write and in my instantiated container, mount reports that /sys is read-only while various cgroup-related submounts are read-write.

The possible extension here to this ticket is that maybe podman and the like are mounting a whole lot more of /sys (and /proc for that matter) than is strictly needed for effective operation. I wonder then if it's possible to override that by a bind mount that exposes nothing for that particular file or directory. It remains to be seen whether podman will fully cooperate with such overrides. I don't know if podman needs a recompile in order to not fully inject all of /sys into the container altogether.

Although, I ran a very quick test to mount some empty directory into /proc/irq in the container, and that appeared to work (in that the container showed it as empty), although the system also reported a double mount of that destination. Potentially the container may have a way to access the underlying mount point; I'm not sure, but that may be a risk. Although just like before, if I don't need to inject it in, I wonder, why do so at all? If I knew exactly what I didn't need to include for /proc and /sys (or conversely, what I did need to include) I would do that. Hopefully what's needed ultimately is pretty minimal.

Perhaps some things can be reported upstream, like a possibility of tightening what's injected into a container for podman and others. Or as you are saying, the DRI functionality that wasn't requested and which may not have even been granted entitlement on the host with UNIX group permissions, could represent too much scope of access for a container that was not specifically requested.

mviereck commented 2 years ago

Anyhow perhaps those devices aren't needed then if I'm able to comment out their inclusion in the X container. Anyhow while it's good to note that the devices aren't passed into the command container, I'm wondering if they need to be passed into the X container, especially when the user doesn't actively request the GPU inclusion.

That is right, there is mostly no hard need to include them in the X container. At least nxagent and Xephyr run fine without it, maybe not using it at all. However, some nested X servers might make use of it. At least the xpra client uses it for better performance, and at least Xorg and Xwaylandneed it. Because the command container does not have access to the devices this way, I do not hesitate to add them even in cases where it could be avoided.

In case one uses --gpu=virgl the device files are used even with X servers that would not support GPU acceleration otherwise, like Xephyr and nxagent.


I'll write some more about the irregular device access through /sys, I'll check a conversation I had some time ago for some details.

Intensity commented 2 years ago

Because the command container does not have access to the devices this way, I do not hesitate to add them even in cases where it could be avoided.

I meant to reply to your overall last response; I didn't see a way in Github to abandon a comment draft.

Anyhow, it's definitely reassuring that the command container doesn't have direct access to those devices. After all, I would imagine x11docker is there to protect against a variety of distribution and external packages that may be less vetted than a long-standing Linux distribution's core X11 packages. I may be inclined not to include the devices in the X container either, for the very reasons of minimality. Although it's probably unlikely (and I don't know all the details), there could still be some vector of exposure, and every year or two there may be a significant enough bug in core X11 functionality that's raised.

I look forward to your thoughts or writeup regarding /sys exposure. I may add my own overrides in the meantime to not include certain directories in the container. Even if it's overkill, it's still an experiment in minimality.