systemd / systemd

The systemd System and Service Manager
https://systemd.io
GNU General Public License v2.0
13.29k stars 3.8k forks source link

Allow PrivateUsers to pick UID/GID ranges #35168

Open ryantimwilson opened 1 day ago

ryantimwilson commented 1 day ago

Component

No response

Is your feature request related to a problem? Please describe

Meta is migrating to using transient systemd units to start containers. Recently, PrivateUsers=identity was added to support 1:1 mapping of UID/GID in the root namespace: https://github.com/systemd/systemd/pull/34321.

The behavior was only to map the first 65536 UID/GIDs and > 65536 is mapped to nobody. This makes the behavior identical to nspawn.

However, this does not work for Meta because we have lots of UID/GIDs and need to map all UIDs 1:1. So it would be useful to have a uid_map like 0 0 4294967295.

But the kernel in the init namespace uses a default uid_map of 0 0 4294967295 : https://man7.org/linux/man-pages/man7/user_namespaces.7.html. And systemd detects whether its in a non-init usernamespace by checking the value of uid_map != 0 0 4294967295: https://github.com/systemd/systemd/blob/893aa45886ef84b1827445dc438e410ad89fbbbf/src/basic/virt.c#L851

Thus, Meta actually uses a UID file like:

0 0 1
1  1  4294967294

This ends up mapping all UIDs 1:1 up to 2^32 - 1 but also ensures systemd's running_in_userns() returns true.

Describe the solution you'd like

I see a few possible approaches:

  1. Allow PrivateUsers to use comma-separated UID ranges like nspawn e.g. PrivateUsers=0,1:4294967295. Unlike nspawn, we would have to understand multiple ranges or systemd needs another way to detect we're in a non-init user namespace.
  2. Add PrivateUsers=identity-all to map all UID/GIDs
  3. Change behavior of PrivateUsers=identity to map all UIDs/GIDs

I mildly prefer option 2 with option 1 as a close second.

1 will certainly take the most implementation work re: parsing but is more extensible and consistent with nspawn.

2 is simpler to implement but is nice because it hides the nasty uid_map workaround.

3 is inconsistent with nspawn, don't like it.

Describe alternatives you've considered

For testing the new container runtime, one of our developers worked around this by hot patching systemd to do option 3 above. But we'd prefer not doing this.

The systemd version you checked that didn't have the feature you are asking for

257

poettering commented 1 day ago

I really dislike static assignments, it just creates headaches. We should try to get rid of that concept, and not add it to new places.

if you want a full uid mapping then that might be ok, but i'd call it "full". i.e. PrivateUsers=full.

ryantimwilson commented 1 day ago

@poettering ack no static assignments definitely would be nicer.

I see 2 implementations:

  1. Separate field PrivateUsersRange=0-4294967295 that only takes effect if PrivateUsers=range
  2. Append the range to then end of PrivateUsers property:PrivateUsers=range:0-42949672951

I prefer a separate field (1) as it is more clear IMHO and easier to support multiple ranges if we need that in the future. Naming could probably be better though...I don't love PrivateUsersRange

What do you think?

poettering commented 1 day ago

Nah, I don#t want static numeric assignments, hence I am voting for a more high-level PrivateUsers=full I must say.

ryantimwilson commented 1 day ago

Oh sorry I misread your comment. PrivateUsers=full it is!