Closed dongsupark closed 5 years ago
It turns out, this issue is actually about an issue of systemd-nspawn that changed its behavior since v239, specifically this commit.
Before that commit, subcgroups under machine.slice
were not explicitly created in case of cgroup v1. Therefore /sys/fs/cgroup/devices/machine.slice/machine-MACHINENAME/devices.list
showed simply a *:* rwm
, which means to allow creation of every device node.
After that commit, systemd-nspawn creates subcgroups for both cgroup v1 and v2, including device cgroups (only v1). As a result, the devices.list
shows an explicit whitelist like it:
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 0:0 rwm
b 0:0 rwm
c 136:* rw
c 10:200 rwm
That's why it's not possible to create devices like (/dev/agpgart) inside the container. But the systemd commit seems to actually do correct things from the hardening perspective.
Then how do we proceed?
One of the approach is to use unit file for running systemd-nspawn. I created a unit file like this for example:
[Unit]
Description=Container %i
Documentation=man:systemd-nspawn(1)
PartOf=machines.target
Before=machines.target
After=network.target systemd-resolved.service
RequiresMountsFor=/var/lib/machines
[Service]
ExecStart=/home/dpark/Dev/systemd/build/systemd-nspawn --capability=cap_audit_control,cap_audit_read,cap_audit_write,cap_audit_control,cap_block_suspend,cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_ipc_lock,cap_ipc_owner,cap_kill,cap_lease,cap_linux_immutable,cap_mac_admin,cap_mac_override,cap_mknod,cap_net_admin,cap_net_bind_service,cap_net_broadcast,cap_net_raw,cap_setgid,cap_setfcap,cap_setpcap,cap_setuid,cap_sys_admin,cap_sys_boot,cap_sys_chroot,cap_sys_module,cap_sys_nice,cap_sys_pacct,cap_sys_ptrace,cap_sys_rawio,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_syslog,cap_wake_alarm --bind=/sys/fs/cgroup --bind-ro=/boot --bind-ro=/lib/modules --boot --notify-ready=yes --keep-unit --machine=mknodtest --directory=/srv/fc29
KillMode=mixed
Type=notify
RestartForceExitStatus=133
SuccessExitStatus=133
WatchdogSec=3min
Slice=machine.slice
Delegate=yes
TasksMax=16384
Environment=SYSTEMD_NSPAWN_API_VFS_WRITABLE=1 SYSTEMD_NSPAWN_USE_CGNS=0
# Enforce a strict device policy, similar to the one nspawn configures when it
# allocates its own scope unit. Make sure to keep these policies in sync if you
# change them!
DevicePolicy=closed
DeviceAllow=/dev/net/tun rwm
DeviceAllow=char-pts rw
# nspawn itself needs access to /dev/loop-control and /dev/loop, to implement
# the --image= option. Add these here, too.
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rw
DeviceAllow=block-blkext rw
# nspawn can set up LUKS encrypted loopback files, in which case it needs
# access to /dev/mapper/control and the block devices /dev/mapper/*.
DeviceAllow=/dev/mapper/control rw
DeviceAllow=block-device-mapper rw
## mknod test inside systemd-nspawn
DeviceAllow=/dev/port rwm
[Install]
WantedBy=machines.target
Run it with:
$ sudo systemctl start systemd-nspawn@mknodtest.service
$ sudo machinectl shell mknodtest
and then inside the container, I can do:
mknod /dev/port c 1 4
For doing that, we need to pass a --keep-unit
option to systemd-nspawn, and keep a unit file like above for the host systemd-nspawn service.
Still not sure about how to update kube-spawn to integrate that unit file.
Dealing with additional unit files is too complicated. I've come up with an approach with systemd-run. Will create a PR.
This issue is a summary of a bug report in the slack channel and its conversation.
When the host systemd version is v239 or newer, it's not possible to create a device node inside the kube-spawn cluster. As a result,
docker pull
fails withPermission Denied
.(an example strace output)
It does not happen when the host systemd-nspawn is v238 or older. So far I have not been able to pinpoint which commit between v238 and v239 has changed the behavior.
I suspect that it has something to do with the default policy change of the system call filter. (blacklist -> whitelist) Though I'm not sure. If that's the case, maybe a correct fix would be to append another option
--system-call-filter=...
to systemd-nspawn from the kube-spawn's side.