hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.77k stars 1.94k forks source link

Rootless Nomad #13669

Open tgross opened 2 years ago

tgross commented 2 years ago

Nomad client agents must be run as root. The notion of "rootless" containers has worked its way through the container ecosystem. This issue is a bit of a brain-dump to assemble some thoughts and discussion around running Nomad "rootless". Please note this isn't yet a roadmap item or even a promise that Nomad will ever support rootless operation. If we decide to pursue this direction, we'd then engage in a design process (RFC) before we could start work on this.

What is Rootless?

Rootless operation has several criteria:

  1. The container orchestrator (ex. Nomad client agent or k8s kubelet) is not running as root.
  2. The container runtime (ex. dockerd, podman) is not running as root.
  3. The root user inside the container cannot be mapped to the root user on the host.

User-namespace mapping (criteria 3) alone can already be done by Nomad for some task drivers, so this issue is primarily focused on running Nomad itself as an unprivileged user.

Why Rootless?

Container runtimes and orchestrators need to perform privileged operations normally reserved to the root user (or to a user that can escalate via sudo or doas):

Therefore running rootless containers has two primary use cases:

Requirements for Rootless

Given the set of privileged operations needed described above, there are some specific requirements for rootless containers:

Nomad-specific quirks

Nomad supports a wide variety of task drivers, which may have their own "runtimes" that may not even be containers (ex. QEMU).

Because Nomad task groups can have mixed task drivers, Nomad has to split duties of setting up the task environment between the task driver and the rest of the client agent. For example, Nomad clients set up network namespaces, perform cpuset cgroup accounting, etc., but delegate bind-mounts to the task driver.

Nomad supports Windows and Mac! (Natively and not by running in a VM!) We definitely want to provide some exec-like isolation for Windows tasks in the future, so whatever we do here should not block off a path to doing so.

But Everyone Else is Doing It!

So how does everyone else do this? All the implementations I've been able to find combine required kernel and OS configuration, user namespaces, and either setuid binaries for networking or user mode networking.

User namespaces are unfortunately a bit half-baked. Even a cursory glance at recent CVEs (ex. CVE-2022-32250, CVE-2022-1055, CVE-2022-24122, CVE-2021-4197, CVE-2022-0185) illustrates the primary problem. Any vulnerability in user namespaces allows an attacker to escalate to full root. While administrators should decide for themselves whether user namespaces are appropriate for their threat model, we should approach this with caution from Nomad so that we're not encouraging their use by folks that assume they're perfectly safe.

Likewise, setuid binaries allow an unprivileged user access to root-privileged operations. This also means that if any unprivileged user on the machine is compromised, they can immediately escalate to some set of root. And if the setuid binary itself is compromised, the attacker owns the entire host. Ideally a setuid binary is well-scoped and well-audited, but because it can be run by an unprivileged user there are it may be easier to attack than an application running as root, especially if that application is in a memory-safe language. For single-user machines like developer laptops, this may not be an unreasonable tradeoff, but this may not be acceptable for production servers.

Among the common set of setuid binaries for rootless containers are the LXC project's lxc-user-nic and newuidmap and newgidmap leaned on by RootlessKit. The RootlessKit project also uses user-mode networking (via slirp4netns) to bypass the requirement for a setuid binary for networking.

Options

Here are some options to anchor a discussion around. These are in rough order of complexity, but aren't necessarily mutually exclusive either.

Documentation: Some administrators may want to accept giving up some features in exchange for rootless Nomad. We can document all the known kernel and OS configuration values, and document all the feature gaps that administrators will face with rootless Nomad.

Graceful Degradation: It may be that there are features that break running tasks entirely (ex. cpuset management comes to mind) under rootless Nomad. Identifying these and allowing for graceful degradation would help administrators who are ok with losing those features. One tricky bit with this is ensuring that none of the features are security sensitive and end up degrading silently! Another is that we'd probably need additional client fingerprinting to ensure tasks don't get scheduled on clients that can't support rootless operation.

Setuid Networking: Currently Nomad uses CNI to implement networking on Linux. We could move operations that require privileges to a setuid binary instead, such as lxc-user-nic. We already should probably document the CNI requirement and fail gracefully without it (this needs more fingerprinting on the client), but we'd need to do the same for a setuid binary. We'd almost certainly need to provide some sort of fallback for administrators who don't want it. And none of this works on Windows.

Multi-Process Nomad: The motivation for wanting rootless Nomad is to reduce privileges. Instead of providing "true" rootless operation, we could follow in the long-standing tradition of Unix applications and have Nomad fork itself into multiple processes, only one of which runs as root.

Nomad is already shipped as a single "multi-call" binary; it can run as a Nomad server agent, a Nomad client agent, as the Nomad CLI, as logmon, or as a docker log shim. The client agent can be further split into a process that runs as root and child processes that perform "riskier" tasks such as network IO with the server, downloading artifacts, rendering templates, etc.

Unlike setuid binaries, this approach would work equally well for Windows. We'd do something like call AdjustTokenPrivileges() with SE_PRIVILEGE_REMOVED set to drop privileges.

exFalso commented 11 months ago

Hi, we've been using rootless nomad for quite some time now and it's been working fine up until version 1.5 and 1.6 where the docker driver broke. The culprits seem to be

which for some reason do explicit checks on whether the client is run as root.. Why is this the case? Having to run nomad as root is currently forcing us to keep using 1.4. Is this root check necessary?

tgross commented 11 months ago

@exFalso a bug report isn't really on-topic for this issue. But the PRs are pretty clear as to why this is happening: the Nomad client can't manage Docker-created cgroups (with cgroups v2) unless it's running as root. If you have reason to believe otherwise, please open a new issue.

exFalso commented 11 months ago

I see, is this a new feature in 1.5?

Also, is there perhaps a way to put the cgroup under a parent owned by the docker group to solve this?

tgross commented 10 months ago

Linking in a couple other issues where rootless behavior is being discussed:

As noted above running as non-root is currently unsupported. Building support isn't on the near-term roadmap. Let's try to keep discussion around rootless deployments here in this issue so that we have a central place to define a body of future work.

Kamilcuk commented 6 months ago

Patching Nomad to support docker on nonroot account is trivially simple:

+++ b/drivers/docker/fingerprint.go
@@ -89,7 +89,7 @@ func (d *Driver) buildFingerprint() *drivers.Fingerprint {
        }

        // disable if non-root on linux systems
-       if runtime.GOOS == "linux" && !utils.IsUnixRoot() {
+       if false && runtime.GOOS == "linux" && !utils.IsUnixRoot() {
                fp.Health = drivers.HealthStateUndetected
                fp.HealthDescription = drivers.DriverRequiresRootMessage
                d.setFingerprintFailure()

After compiling with make release ALL_TARGETS=linux_amd64 I have a running nomad 1.7.5 with docker driver. The properties cpu.totalcompute, memory.totalbytes, cpu.numcores and even numa.node0.cores look fine on a super-old Fedora29 with cgroups1.

tgross commented 6 months ago

@Kamilcuk It isn't simply a matter of removing the check or it'd be done already. You'll probably find Nomad degrades semi-gracefully without those checks. But you'll find that it degrades gracefully in a way that's markedly less secure because it can't do things like create mounts, set up bridge networking, etc.

Kamilcuk commented 6 months ago

Hi, I understand, it would be nice to know if there is something in code that might potentially break. Yes, non-root account is not able to set up bridge networking and create mounts. That is fine and understandable, he is not root, those features are not needed (for me). Docker socket is available, so config { mount volumes } work.

tgross commented 3 weeks ago

Hi folks! Just wanted to let folks know that we haven't completely forgotten about this issue and we've started to make a small bit of movement. In https://github.com/hashicorp/nomad/pull/23804 and https://github.com/hashicorp/nomad/pull/23803 I've made a few small changes to fingerprinting that should unblock a few more uses once Nomad 1.8.4 ships.

I've been doing some more detailed break down of the work needed, using the script below as a starting point for exploration. This script must be run as root so that

experimental setup script ```bash #!/usr/bin/env bash set -e echo_ok() { echo "$(tput setaf 2)[✔] $(tput sgr0)$1" } id -u nomad || useradd --system --user-group -G docker --shell /bin/false nomad awk -F':' '/docker/{print $4}' /etc/group | grep -q nomad || usermod -G docker nomad echo_ok "ensured nomad user and group, with access to docker" mkdir -p /sys/fs/cgroup/nomad.slice/reserve.slice mkdir -p /sys/fs/cgroup/nomad.slice/share.slice chown -R nomad:nomad /sys/fs/cgroup/nomad.slice echo_ok "created reserve.slice and share.slice cgroup directories owned by nomad" echo "+cpuset" | tee -a /sys/fs/cgroup/cgroup.subtree_control echo "+cpu" | tee -a /sys/fs/cgroup/cgroup.subtree_control echo "+io" | tee -a /sys/fs/cgroup/cgroup.subtree_control echo "+memory" | tee -a /sys/fs/cgroup/cgroup.subtree_control echo "+pids" | tee -a /sys/fs/cgroup/cgroup.subtree_control echo_ok "verified cpuset cpu io memory pids controllers enabled" datadir=$(awk -F' +' '/data_dir/ {print $3}' /etc/nomad.d/base.hcl) mkdir -p "$datadir" chown -R nomad:nomad "$datadir" echo_ok "created $datadir owned by nomad" # on ubuntu requires: # apt install acl touch /run/xtables.lock chown nomad:nomad /run/xtables.lock mkdir -p /var/run/docker/netns chown -R nomad:nomad /var/run/docker/netns chmod -R g+s /var/run/docker/netns setfacl -Rdm g:nomad:rwx -m u:nomad:rwx /var/run/docker/netns setfacl -Rm g:nomad:rwx -m u:nomad:rwx /var/run/docker/netns setfacl -m g:nomad:rx /var/run/docker echo_ok "modified file ACLs for network namespace configuration" cat <

Here's some confirmed specific issues we'll looking into further:

bridge networking

Bridge networking won't work with Docker in non-root because the dockerd-owned network namespace files are written without group permissions. We can work around this by running the script above to give Nomad access to write to that file tree. Once that's done, or with non-Docker/Podman tasks, this leaves us with errors like:

failed to setup alloc: pre-run hook \"network\" failed: failed to configure networking for alloc: failed to configure network: plugin type=\"loopback\" failed (add): error switching to ns /var/run/docker/netns/1d9c753e71f7: Error switching to ns /var/run/docker/netns/1d9c753e71f7: operation not permitted

We have to have CAP_SYS_ADMIN, not CAP_NET_ADMIN to enter the network namespace in order to run CNI plugins that need to run inside it. Even with CAP_SYS_ADMIN there are additional file system permissions I haven't sorted out yet, resulting in errors like:

failed to setup alloc: pre-run hook \"network\" failed: failed to configure networking for alloc: failed to configure network: plugin type=\"bridge\" failed (add): permission denied

But CAP_SYS_ADMIN is effectively root, and there's no such thing as unprivileged network namespaces. So as I noted in the original post, for networking we're almost certainly looking at setuid-root binaries or userland networking. There may be opportunity to revive task-level networking, but that would not work when tasks want to share network namespaces (especially tasks using different drivers in the same group).

docker, podman drivers

As far as we can tell, it's not possible to perform cpuset management when Docker/Podman are using the systemd cgroup manager, as runc asks systemd to create the cgroup which is then owned by root regardless of what we might do with the cgroup filesystem otherwise. It doesn't look like we can set a cgroup parent in a way that systemd will accept either, although I'd like to revisit that. This breaks resources.cores and NUMA-aware scheduling. We'd like to investigate further if it can be made to work with the non-default cgroupfs manager.

exec, java, and exec2 drivers

Both these drivers currently have a hard-coded error if not running as root. But with that removed, we run into the requirement for them to create mount and pid namespaces. We can do this unprivileged (see unshare) but can't configure networking as described above.

templates

Isolation for templates uses chroot. chroot will fail gracefully but silently to the job author, with a log message template-render sandbox %q not available: %v in the client logs.

allocation filesystem

We don't chown the contents of the allocdir when streaming them for migrations, so these will all be owned by Nomad instead of original owner. This likely breaks the alloc at the destination. We don't create tmpfs for /secrets if not root; we can change this to try to create the tmpfs and fail gracefully.

Kamilcuk commented 3 weeks ago

hi @tgross . Yay! I have taken master branch and compiled and run on one our testing machine. It came up, and I was able to test some docker containers - it works.

[sysavtbuild@weelxavt077d ~]$ nomad --version
Nomad v1.8.4-dev
BuildDate 2024-08-16T13:47:19Z
Revision d6be784e2d090861868d9603e6716df26b8e5f0d

It tells me it has cgroups:

Aug 16 10:30:10 nomad[1989975]:     2024-08-16T10:30:10.106-0400 [INFO]  client.proclib.cg2: initializing nomad cgroups: cores=0-31 

The following message is printed every time docker is started, which I guess it is expected.

Aug 16 10:38:40 nomad[1989975]:     2024-08-16T10:38:40.127-0400 [WARN]  client.driver_mgr.docker: docker driver requires running as root: resources.cores and NUMA-aware scheduling will not function correctly on this node, including for non-docker tasks: driver=docker

Bottom line, that means I am now eagerly waiting for the next release. Great. Thanks!