check capabilities instead of UID=0 when mounting tmpfs for secrets

dposton80 commented 3 years ago

Nomad version

0.12 (but checking code appears to be still present in latest versions)

Operating system and Environment details

docker, linux RHEL7

Issue

When running from a container that is built without a USER command, the container can run with uid 0 but without root privileges. This causes the raw_exec driver to fail with e.g.

[ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=eba12eb6-7b3a-670a-62a0-aeb559c27eea task=temp-chunkserver-task error="prestart hook "task_dir" failed: mount: operation not permitted"

The problem seems to be when it tries to create the secrets dir in fs_linux.go. It assumes that uid 0 will have the ability to perform the mount operation, but this is not true in my case

Could that code do a better test for root perms? Or just try the mount and fail back to creating the dir? Perhaps it should always just create the dir if fs isolation is none?

Reproduction steps

Due to my locked-down firm environment, I'm not able to freely run docker myself, so not 100% sure on what exact steps are to get a container to run as uid 0 but without root perms.

build container without USER command in Dockerfile
run container in non-privileged mode
run nomad client from inside the container
observe that client can't launch any tasks with raw_exec

I'm not quite sure what causes the container to be run with uid 0 - it doesn't always seem to be the case (it worked OK in a kubernetes cluster, but not from a gitlab runner).

I realise that not having a USER command in a dockerfile is bad practice. But might be nice if nomad just worked anyway.

tgross commented 3 years ago

@dposton80 running Nomad clients in a container isn't really a supported configuration. Nomad clients must be run as root because they require CAP_SYSADMIN and (usually) CAP_NET_ADMIN. So if you're running Nomad in Docker it'll need to run with --privileged at a minimum, and probably without userns isolation on dockerd (although I haven't verified that).

dposton80 commented 3 years ago

Nomad clients don't require root to run in general - nomad would not be useful to us if they did (at least for raw_exec) as we don't have root access to our machines. There seems to be some intention towards this - see #2178 and #2070

It works fine if I set the USER correctly in the container (to a non-root, unprivileged user, that has uid != 0), and run the container without --privileged. So why not make it work for the uid = 0 case as well... looks like a small fix to line 47 of fs_linux.go?

tgross commented 3 years ago

Nomad documents pretty clearly that we expect clients to be running as root. Many features of Nomad clients won't work if the client doesn't have the appropriate capabilities, and the matrix of each feature being assessed as "impossible to support" vs "can fallback to reduced functionality" becomes a combinatorial explosion.

So why not make it work for the uid = 0 case as well... looks like a small fix to line 47 of fs_linux.go?

Hm, what would you propose? While you can do getpcaps to get the processes own capabilities and check for cap_sys_admin in addition to uid=0. But now we've introduced a fallback behavior that silently changes the security characteristics of the secrets directory.

dposton80 commented 3 years ago

Nomad with raw_exec is useful as a scheduler for regular processes from non-root accounts. Are you saying we should not continue to rely on this behaviour? I don't think I'm alone here e.g. https://serverascode.com/2016/10/18/distributed-cron-with-nomad.html

I would defer to your judgement on a possible change, it seems like it is already doing a silent fallback behaviour based on root/non-root so I wouldn't have thought getpcaps would be much different? Could the test be changed to be just getpcaps rather than uid == 0? Is there currently a risk of running as a privileged non-root user and secrets unexpectedly not going into a tmpfs?

tgross commented 3 years ago

Are you saying we should not continue to rely on this behaviour? I don't think I'm alone here

I don't think we'd intentionally break that behavior. But the use case for raw_exec with no network configuration is a fairly narrow range of features and the moment someone steps outside of that range Nomad won't work.

I would defer to your judgement on a possible change, it seems like it is already doing a silent fallback behaviour based on root/non-root so I wouldn't have thought getpcaps would be much different? Could the test be changed to be just getpcaps rather than uid == 0? Is there currently a risk of running as a privileged non-root user and secrets unexpectedly not going into a tmpfs?

Apologies, I could have explained that more carefully -- I mean that we'd be changing the current behavior without it being obvious to the operator: previously we'd fail (as you've seen here) whereas now the job works but maybe not as expected.

I'm not in theory opposed to the notion of swapping a caps check for the UID check, but at a quick glance it looks like there are a dozen or so other places in the codebase where we have a check for UID=0 so doing it as a one-off just for fs_linux seems like it would just push the problem to somewhere else. That being said, I also realize it would probably fix your immediate use case!

So I brought this issue up in an internal discussion and came away a general consensus with my colleagues about a path forward:

We should document what specific capabilities Nomad needs on the Requirements docs. (ref https://github.com/hashicorp/nomad/issues/10050)
We should also document there why we don't recommend running Nomad clients in containers, but also point back to the requirements to tell folks what they'll need to take on.
We're going to open an issue to break down the getpcaps vs UID=0 checks, so that we can take each of those and we can incrementally improve the situation as time permits. ref https://github.com/hashicorp/nomad/issues/10051

I'll re-open this issue and rename it a bit to serve as a pointer for that umbrella issue. Thanks for your patience here and I appreciate you walking me through the use case so carefully.

dposton80 commented 3 years ago

Thank you for looking into this. Also, thank you for an excellent product and for making your codebase a joy to debug. It's a great example of good software engineering.

hashicorp / nomad