Open Intensity opened 2 years ago
When I don't specifically add --gpu (or even when I add --gpu=no) I find various GPU/DRM devices appear in the container command line.
Could you please give me an example? A command and the output of --debug
that shows the generated docker/podman command.
If you have image x11docker/xserver
, you might be confused by the command that runs the X container. It contains GPU devices, while the container for the desired command does not.
Example:
$ x11docker --gpu=no --debug --desktop x11docker/xfce
[...]
DEBUGNOTE[09:47:44,138]: X container command (rootless no):
docker run --pull=never \
--rm \
--detach \
--name x11docker_X128_xserver_39260387129 \
--mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/share,target=/home/lauscher/.cache/x11docker/39260387129-xfce/share \
--mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/etcpasswd.xcontainer,target=/etc/passwd,readonly \
--mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/etcgroup.xcontainer,target=/etc/group,readonly \
--mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/xcontainerrc,target=/xcontainerrc,readonly \
--security-opt label=type:container_runtime_t \
--ipc=shareable \
--runtime runc \
--cap-drop ALL \
--security-opt=no-new-privileges \
--user 1000:1000 \
--mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/tmp,target=/tmp \
--mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.server,target=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.server \
--mount type=bind,source=/home/lauscher/.cache/x11docker/modelines,target=/home/lauscher/.cache/x11docker/modelines,readonly \
--env DISPLAY=:0.0 \
--mount type=bind,source=/tmp/.X11-unix/X0,target=/X0,readonly \
--env XAUTHORITY=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.host.0-0 \
--mount type=bind,source=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.host.0-0,target=/home/lauscher/.cache/x11docker/39260387129-xfce/Xauthority.host.0-0 \
--env LD_PRELOAD=/lib/x86_64-linux-gnu/libdl.so.2:/home/lauscher/.cache/x11docker/39260387129-xfce/share/XlibNoSHM.so \
--device /dev/dri/card0:/dev/dri/card0 \
--device /dev/dri/renderD128:/dev/dri/renderD128 \
--device /dev/vga_arbiter:/dev/vga_arbiter \
--group-add 44 \
--group-add 133 \
x11docker/xserver bash /xcontainerrc
[...]
DEBUGNOTE[09:47:48,118]: docker command (rootless no):
/usr/bin/docker run \
--pull never \
--rm \
--tty \
--name x11docker_X128_x11docker-xfce_39260387129 \
--user 1000:1000 \
--userns=host \
--group-add 1000 \
--runtime='runc' \
--network none \
--cap-drop ALL \
--security-opt no-new-privileges \
--security-opt label=type:container_runtime_t \
--mount type=bind,source='/usr/bin/tini-static',target='/usr/local/bin/init',readonly \
--tmpfs /run:exec \
--tmpfs /run/lock \
--tmpfs /tmp \
--mount type=bind,source='/home/lauscher/.cache/x11docker/39260387129-xfce/share',target='/x11docker' \
--mount type=bind,source='/home/lauscher/.cache/x11docker/39260387129-xfce/tmp/.X11-unix/X128',target='/tmp/.X11-unix/X128',readonly \
--ipc=container:x11docker_X128_xserver_39260387129 \
--workdir '/tmp' \
--entrypoint env \
--env 'container=docker' \
--env 'XAUTHORITY=/x11docker/Xauthority.client' \
--env 'DISPLAY=:128' \
--env 'USER=lauscher' \
-- x11docker/xfce /usr/local/bin/init -- /bin/sh - /x11docker/containerrc
I'm not sure if there are other transforms on the container command line prior to running it
The command shown in the --debug
output is the one that is finally executed.
if I don't specifically add --gpu I might prefer that those devices not get passed in altogether.
That is entirely right. And it is part of the concept of x11docker to only pass what is needed.
Just a note, I know it for a short time myself: Even if the device files like in /dev/dri
are not shared, they are still accessible somewhere below /sys
in container. This affects docker as well as podman.
So far, I only found glxinfo
and glxgears
that access the device files this way. (Found with strace
.)
For example, try:
$ x11docker --weston-xwayland x11docker/xfce glxgears
You'll find it to be GPU accelerated although /dev/dri
is not shared. It even doesn't matter if the user is in groups video
and render
.
This is something I find disturbing. I wonder if this should be reported as a bug somewhere. But where? Does it even go to kernel level?.
Hi. The output you pasted above (with the /dev/dri
devices passed in) shows what I've also been seeing. Although in a quick test, I commented out the setup_gpu_devicelist
function contents entirely (so that those devices aren't named in the X container) and x11docker
still created a container (granted, I didn't test very much in it). Anyhow perhaps those devices aren't needed then if I'm able to comment out their inclusion in the X container.
I overlooked or forgot the fact that there may have been two containers running (the X container, and the command container). Although I also wasn't sure why ps
output didn't show most of those parameters still living on the command line. Anyhow while it's good to note that the devices aren't passed into the command container, I'm wondering if they need to be passed into the X container, especially when the user doesn't actively request the GPU inclusion.
That's a good point about potentially unnecessary access being exposed through /sys
which I believe is a pretty powerful interface. Although when you note that some DRI access is exposed in /sys
, where in the path is that? I'm asking because I'm less familiar with what to look for. And for your case, is it showing on a /sys
that's mounted as read-only or one that's mounted read-write? While I'd prefer no access to what's needed, still read-only is better than read-write and in my instantiated container, mount
reports that /sys
is read-only while various cgroup-related submounts are read-write.
The possible extension here to this ticket is that maybe podman
and the like are mounting a whole lot more of /sys
(and /proc
for that matter) than is strictly needed for effective operation. I wonder then if it's possible to override that by a bind mount that exposes nothing for that particular file or directory. It remains to be seen whether podman
will fully cooperate with such overrides. I don't know if podman
needs a recompile in order to not fully inject all of /sys
into the container altogether.
Although, I ran a very quick test to mount some empty directory into /proc/irq
in the container, and that appeared to work (in that the container showed it as empty), although the system also reported a double mount of that destination. Potentially the container may have a way to access the underlying mount point; I'm not sure, but that may be a risk. Although just like before, if I don't need to inject it in, I wonder, why do so at all? If I knew exactly what I didn't need to include for /proc
and /sys
(or conversely, what I did need to include) I would do that. Hopefully what's needed ultimately is pretty minimal.
Perhaps some things can be reported upstream, like a possibility of tightening what's injected into a container for podman
and others. Or as you are saying, the DRI functionality that wasn't requested and which may not have even been granted entitlement on the host with UNIX group permissions, could represent too much scope of access for a container that was not specifically requested.
Anyhow perhaps those devices aren't needed then if I'm able to comment out their inclusion in the X container. Anyhow while it's good to note that the devices aren't passed into the command container, I'm wondering if they need to be passed into the X container, especially when the user doesn't actively request the GPU inclusion.
That is right, there is mostly no hard need to include them in the X container. At least nxagent
and Xephyr
run fine without it, maybe not using it at all.
However, some nested X servers might make use of it. At least the xpra
client uses it for better performance, and at least Xorg
and Xwayland
need it.
Because the command container does not have access to the devices this way, I do not hesitate to add them even in cases where it could be avoided.
In case one uses --gpu=virgl
the device files are used even with X servers that would not support GPU acceleration otherwise, like Xephyr
and nxagent
.
I'll write some more about the irregular device access through /sys
, I'll check a conversation I had some time ago for some details.
Because the command container does not have access to the devices this way, I do not hesitate to add them even in cases where it could be avoided.
I meant to reply to your overall last response; I didn't see a way in Github to abandon a comment draft.
Anyhow, it's definitely reassuring that the command container doesn't have direct access to those devices. After all, I would imagine x11docker
is there to protect against a variety of distribution and external packages that may be less vetted than a long-standing Linux distribution's core X11 packages. I may be inclined not to include the devices in the X container either, for the very reasons of minimality. Although it's probably unlikely (and I don't know all the details), there could still be some vector of exposure, and every year or two there may be a significant enough bug in core X11 functionality that's raised.
I look forward to your thoughts or writeup regarding /sys
exposure. I may add my own overrides in the meantime to not include certain directories in the container. Even if it's overkill, it's still an experiment in minimality.
When I don't specifically add
--gpu
(or even when I add--gpu=no
) I find various GPU/DRM devices appear in the container command line.I'm wondering if this is intentional. In some cases I may prefer not to pass in devices if I won't be making use of them.
Although I am not completely clear how to validate what's actually passed in. I'm not sure if there are other transforms on the container command line prior to running it, but if I don't specifically add
--gpu
I might prefer that those devices not get passed in altogether. I'm wondering if there is meant to be some conditional login on their inclusion and the conditions weren't added in certain sections ofx11docker
.In my case I'm running podman user mode, although I'm guessing this sort of thing would be easy to reproduce as it seems to be logic-related.