Open tgross opened 2 years ago
Hi, we've been using rootless nomad for quite some time now and it's been working fine up until version 1.5 and 1.6 where the docker driver broke. The culprits seem to be
which for some reason do explicit checks on whether the client is run as root.. Why is this the case? Having to run nomad as root is currently forcing us to keep using 1.4. Is this root check necessary?
@exFalso a bug report isn't really on-topic for this issue. But the PRs are pretty clear as to why this is happening: the Nomad client can't manage Docker-created cgroups (with cgroups v2) unless it's running as root. If you have reason to believe otherwise, please open a new issue.
I see, is this a new feature in 1.5?
Also, is there perhaps a way to put the cgroup under a parent owned by the docker group to solve this?
Linking in a couple other issues where rootless behavior is being discussed:
As noted above running as non-root is currently unsupported. Building support isn't on the near-term roadmap. Let's try to keep discussion around rootless deployments here in this issue so that we have a central place to define a body of future work.
Patching Nomad to support docker on nonroot account is trivially simple:
+++ b/drivers/docker/fingerprint.go
@@ -89,7 +89,7 @@ func (d *Driver) buildFingerprint() *drivers.Fingerprint {
}
// disable if non-root on linux systems
- if runtime.GOOS == "linux" && !utils.IsUnixRoot() {
+ if false && runtime.GOOS == "linux" && !utils.IsUnixRoot() {
fp.Health = drivers.HealthStateUndetected
fp.HealthDescription = drivers.DriverRequiresRootMessage
d.setFingerprintFailure()
After compiling with make release ALL_TARGETS=linux_amd64
I have a running nomad 1.7.5 with docker driver. The properties cpu.totalcompute
, memory.totalbytes
, cpu.numcores
and even numa.node0.cores
look fine on a super-old Fedora29 with cgroups1.
@Kamilcuk It isn't simply a matter of removing the check or it'd be done already. You'll probably find Nomad degrades semi-gracefully without those checks. But you'll find that it degrades gracefully in a way that's markedly less secure because it can't do things like create mounts, set up bridge networking, etc.
Hi, I understand, it would be nice to know if there is something in code that might potentially break. Yes, non-root account is not able to set up bridge networking and create mounts. That is fine and understandable, he is not root, those features are not needed (for me). Docker socket is available, so config { mount volumes }
work.
Hi folks! Just wanted to let folks know that we haven't completely forgotten about this issue and we've started to make a small bit of movement. In https://github.com/hashicorp/nomad/pull/23804 and https://github.com/hashicorp/nomad/pull/23803 I've made a few small changes to fingerprinting that should unblock a few more uses once Nomad 1.8.4 ships.
I've been doing some more detailed break down of the work needed, using the script below as a starting point for exploration. This script must be run as root so that
Here's some confirmed specific issues we'll looking into further:
bridge networking
Bridge networking won't work with Docker in non-root because the dockerd
-owned network namespace files are written without group permissions. We can work around this by running the script above to give Nomad access to write to that file tree. Once that's done, or with non-Docker/Podman tasks, this leaves us with errors like:
failed to setup alloc: pre-run hook \"network\" failed: failed to configure networking for alloc: failed to configure network: plugin type=\"loopback\" failed (add): error switching to ns /var/run/docker/netns/1d9c753e71f7: Error switching to ns /var/run/docker/netns/1d9c753e71f7: operation not permitted
We have to have CAP_SYS_ADMIN
, not CAP_NET_ADMIN
to enter the network namespace in order to run CNI plugins that need to run inside it. Even with CAP_SYS_ADMIN
there are additional file system permissions I haven't sorted out yet, resulting in errors like:
failed to setup alloc: pre-run hook \"network\" failed: failed to configure networking for alloc: failed to configure network: plugin type=\"bridge\" failed (add): permission denied
But CAP_SYS_ADMIN
is effectively root, and there's no such thing as unprivileged network namespaces. So as I noted in the original post, for networking we're almost certainly looking at setuid-root binaries or userland networking. There may be opportunity to revive task-level networking, but that would not work when tasks want to share network namespaces (especially tasks using different drivers in the same group).
docker, podman drivers
As far as we can tell, it's not possible to perform cpuset management when Docker/Podman are using the systemd cgroup manager, as runc asks systemd to create the cgroup which is then owned by root regardless of what we might do with the cgroup filesystem otherwise. It doesn't look like we can set a cgroup parent in a way that systemd will accept either, although I'd like to revisit that. This breaks resources.cores
and NUMA-aware scheduling. We'd like to investigate further if it can be made to work with the non-default cgroupfs manager.
exec, java, and exec2 drivers
Both these drivers currently have a hard-coded error if not running as root. But with that removed, we run into the requirement for them to create mount and pid namespaces. We can do this unprivileged (see unshare
) but can't configure networking as described above.
templates
Isolation for templates uses chroot. chroot will fail gracefully but silently to the job author, with a log message template-render sandbox %q not available: %v
in the client logs.
allocation filesystem
We don't chown the contents of the allocdir when streaming them for migrations, so these will all be owned by Nomad instead of original owner. This likely breaks the alloc at the destination. We don't create tmpfs for /secrets if not root; we can change this to try to create the tmpfs and fail gracefully.
hi @tgross . Yay! I have taken master branch and compiled and run on one our testing machine. It came up, and I was able to test some docker containers - it works.
[sysavtbuild@weelxavt077d ~]$ nomad --version
Nomad v1.8.4-dev
BuildDate 2024-08-16T13:47:19Z
Revision d6be784e2d090861868d9603e6716df26b8e5f0d
It tells me it has cgroups:
Aug 16 10:30:10 nomad[1989975]: 2024-08-16T10:30:10.106-0400 [INFO] client.proclib.cg2: initializing nomad cgroups: cores=0-31
The following message is printed every time docker is started, which I guess it is expected.
Aug 16 10:38:40 nomad[1989975]: 2024-08-16T10:38:40.127-0400 [WARN] client.driver_mgr.docker: docker driver requires running as root: resources.cores and NUMA-aware scheduling will not function correctly on this node, including for non-docker tasks: driver=docker
Bottom line, that means I am now eagerly waiting for the next release. Great. Thanks!
Nomad client agents must be run as root. The notion of "rootless" containers has worked its way through the container ecosystem. This issue is a bit of a brain-dump to assemble some thoughts and discussion around running Nomad "rootless". Please note this isn't yet a roadmap item or even a promise that Nomad will ever support rootless operation. If we decide to pursue this direction, we'd then engage in a design process (RFC) before we could start work on this.
What is Rootless?
Rootless operation has several criteria:
root
.dockerd
,podman
) is not running asroot
.root
user inside the container cannot be mapped to theroot
user on the host.User-namespace mapping (criteria 3) alone can already be done by Nomad for some task drivers, so this issue is primarily focused on running Nomad itself as an unprivileged user.
Why Rootless?
Container runtimes and orchestrators need to perform privileged operations normally reserved to the
root
user (or to a user that can escalate viasudo
ordoas
):Therefore running rootless containers has two primary use cases:
Requirements for Rootless
Given the set of privileged operations needed described above, there are some specific requirements for rootless containers:
kernel.unprivileged_userns_clone=1
dockerd
,podman
,containerd
, etc) must be configured for rootless operation. This requires cgroups v2 + user namespaces + either a patched kernel or kernel module (overlay.ko
) allowing unprivileged overlayFS or a fuse overlay FS. While this is all the responsibility of the task driver engine, we'd probably need to document anything we intend to support here.Nomad-specific quirks
Nomad supports a wide variety of task drivers, which may have their own "runtimes" that may not even be containers (ex. QEMU).
Because Nomad task groups can have mixed task drivers, Nomad has to split duties of setting up the task environment between the task driver and the rest of the client agent. For example, Nomad clients set up network namespaces, perform cpuset cgroup accounting, etc., but delegate bind-mounts to the task driver.
Nomad supports Windows and Mac! (Natively and not by running in a VM!) We definitely want to provide some
exec
-like isolation for Windows tasks in the future, so whatever we do here should not block off a path to doing so.But Everyone Else is Doing It!
So how does everyone else do this? All the implementations I've been able to find combine required kernel and OS configuration, user namespaces, and either setuid binaries for networking or user mode networking.
User namespaces are unfortunately a bit half-baked. Even a cursory glance at recent CVEs (ex. CVE-2022-32250, CVE-2022-1055, CVE-2022-24122, CVE-2021-4197, CVE-2022-0185) illustrates the primary problem. Any vulnerability in user namespaces allows an attacker to escalate to full root. While administrators should decide for themselves whether user namespaces are appropriate for their threat model, we should approach this with caution from Nomad so that we're not encouraging their use by folks that assume they're perfectly safe.
Likewise, setuid binaries allow an unprivileged user access to
root
-privileged operations. This also means that if any unprivileged user on the machine is compromised, they can immediately escalate to some set ofroot
. And if the setuid binary itself is compromised, the attacker owns the entire host. Ideally a setuid binary is well-scoped and well-audited, but because it can be run by an unprivileged user there are it may be easier to attack than an application running as root, especially if that application is in a memory-safe language. For single-user machines like developer laptops, this may not be an unreasonable tradeoff, but this may not be acceptable for production servers.Among the common set of setuid binaries for rootless containers are the LXC project's
lxc-user-nic
andnewuidmap
andnewgidmap
leaned on by RootlessKit. The RootlessKit project also uses user-mode networking (viaslirp4netns
) to bypass the requirement for a setuid binary for networking.Options
Here are some options to anchor a discussion around. These are in rough order of complexity, but aren't necessarily mutually exclusive either.
Documentation: Some administrators may want to accept giving up some features in exchange for rootless Nomad. We can document all the known kernel and OS configuration values, and document all the feature gaps that administrators will face with rootless Nomad.
Graceful Degradation: It may be that there are features that break running tasks entirely (ex. cpuset management comes to mind) under rootless Nomad. Identifying these and allowing for graceful degradation would help administrators who are ok with losing those features. One tricky bit with this is ensuring that none of the features are security sensitive and end up degrading silently! Another is that we'd probably need additional client fingerprinting to ensure tasks don't get scheduled on clients that can't support rootless operation.
Setuid Networking: Currently Nomad uses CNI to implement networking on Linux. We could move operations that require privileges to a setuid binary instead, such as
lxc-user-nic
. We already should probably document the CNI requirement and fail gracefully without it (this needs more fingerprinting on the client), but we'd need to do the same for a setuid binary. We'd almost certainly need to provide some sort of fallback for administrators who don't want it. And none of this works on Windows.Multi-Process Nomad: The motivation for wanting rootless Nomad is to reduce privileges. Instead of providing "true" rootless operation, we could follow in the long-standing tradition of Unix applications and have Nomad fork itself into multiple processes, only one of which runs as root.
Nomad is already shipped as a single "multi-call" binary; it can run as a Nomad server agent, a Nomad client agent, as the Nomad CLI, as logmon, or as a docker log shim. The client agent can be further split into a process that runs as root and child processes that perform "riskier" tasks such as network IO with the server, downloading artifacts, rendering templates, etc.
Unlike setuid binaries, this approach would work equally well for Windows. We'd do something like call
AdjustTokenPrivileges()
withSE_PRIVILEGE_REMOVED
set to drop privileges.