kinvolk / kube-spawn

A tool for creating multi-node Kubernetes clusters on a Linux machine using kubeadm & systemd-nspawn. Brought to you by the Kinvolk team.
https://kinvolk.io
Apache License 2.0
442 stars 41 forks source link

cannot create a device node inside systemd-nspawn containers #324

Closed dongsupark closed 5 years ago

dongsupark commented 6 years ago

This issue is a summary of a bug report in the slack channel and its conversation.

When the host systemd version is v239 or newer, it's not possible to create a device node inside the kube-spawn cluster. As a result, docker pull fails with Permission Denied.

(an example strace output)

[pid 30990] mknodat(AT_FDCWD, "/dev/agpgart", S_IFCHR|0660, makedev(10, 175) <unfinished ...>
[pid 30990] <... mknodat resumed> )     = -1 EPERM (Operation not permitted)

It does not happen when the host systemd-nspawn is v238 or older. So far I have not been able to pinpoint which commit between v238 and v239 has changed the behavior.

I suspect that it has something to do with the default policy change of the system call filter. (blacklist -> whitelist) Though I'm not sure. If that's the case, maybe a correct fix would be to append another option --system-call-filter=... to systemd-nspawn from the kube-spawn's side.

dongsupark commented 5 years ago

It turns out, this issue is actually about an issue of systemd-nspawn that changed its behavior since v239, specifically this commit.

Before that commit, subcgroups under machine.slice were not explicitly created in case of cgroup v1. Therefore /sys/fs/cgroup/devices/machine.slice/machine-MACHINENAME/devices.list showed simply a *:* rwm, which means to allow creation of every device node.

After that commit, systemd-nspawn creates subcgroups for both cgroup v1 and v2, including device cgroups (only v1). As a result, the devices.list shows an explicit whitelist like it:

c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 0:0 rwm
b 0:0 rwm
c 136:* rw
c 10:200 rwm

That's why it's not possible to create devices like (/dev/agpgart) inside the container. But the systemd commit seems to actually do correct things from the hardening perspective.

Then how do we proceed?

One of the approach is to use unit file for running systemd-nspawn. I created a unit file like this for example:

[Unit]
Description=Container %i
Documentation=man:systemd-nspawn(1)
PartOf=machines.target
Before=machines.target
After=network.target systemd-resolved.service
RequiresMountsFor=/var/lib/machines

[Service]
ExecStart=/home/dpark/Dev/systemd/build/systemd-nspawn --capability=cap_audit_control,cap_audit_read,cap_audit_write,cap_audit_control,cap_block_suspend,cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_ipc_lock,cap_ipc_owner,cap_kill,cap_lease,cap_linux_immutable,cap_mac_admin,cap_mac_override,cap_mknod,cap_net_admin,cap_net_bind_service,cap_net_broadcast,cap_net_raw,cap_setgid,cap_setfcap,cap_setpcap,cap_setuid,cap_sys_admin,cap_sys_boot,cap_sys_chroot,cap_sys_module,cap_sys_nice,cap_sys_pacct,cap_sys_ptrace,cap_sys_rawio,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_syslog,cap_wake_alarm --bind=/sys/fs/cgroup --bind-ro=/boot --bind-ro=/lib/modules --boot --notify-ready=yes --keep-unit --machine=mknodtest --directory=/srv/fc29
KillMode=mixed
Type=notify
RestartForceExitStatus=133
SuccessExitStatus=133
WatchdogSec=3min
Slice=machine.slice
Delegate=yes
TasksMax=16384
Environment=SYSTEMD_NSPAWN_API_VFS_WRITABLE=1 SYSTEMD_NSPAWN_USE_CGNS=0

# Enforce a strict device policy, similar to the one nspawn configures when it
# allocates its own scope unit. Make sure to keep these policies in sync if you
# change them!
DevicePolicy=closed
DeviceAllow=/dev/net/tun rwm
DeviceAllow=char-pts rw

# nspawn itself needs access to /dev/loop-control and /dev/loop, to implement
# the --image= option. Add these here, too.
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rw
DeviceAllow=block-blkext rw

# nspawn can set up LUKS encrypted loopback files, in which case it needs
# access to /dev/mapper/control and the block devices /dev/mapper/*.
DeviceAllow=/dev/mapper/control rw
DeviceAllow=block-device-mapper rw

## mknod test inside systemd-nspawn
DeviceAllow=/dev/port rwm

[Install]
WantedBy=machines.target

Run it with:

$ sudo systemctl start systemd-nspawn@mknodtest.service
$ sudo machinectl shell mknodtest

and then inside the container, I can do:

mknod /dev/port c 1 4

For doing that, we need to pass a --keep-unit option to systemd-nspawn, and keep a unit file like above for the host systemd-nspawn service.

Still not sure about how to update kube-spawn to integrate that unit file.

dongsupark commented 5 years ago

Dealing with additional unit files is too complicated. I've come up with an approach with systemd-run. Will create a PR.