Open filbranden opened 5 years ago
The testing introduced in #2014, which runs on Travis-CI, which happens to use Ubuntu Xenial, which includes systemd v229 (much older than the one I was testing with) made it clear that it's hard to have libcontainer make a heavier use of systemd properties in StartTransientUnit while maintaining compatibility with older versions of systemd, which is an objective to be preserved.
So I'm changing my approach slightly here. Instead of refactoring the current code, I'll write a new implementation that works in unified cgroup hierarchy only and then use that code only when unified cgroup hierarchy is detected, falling back to the current code in case the hybrid or legacy hierarchy is in use.
The issue with backward compatibility of properties is still a relevant issue, so I opened issue systemd/systemd#12460 requesting a better solution for that. True, that will only be available in systemd v243 or v244, but since no mainstream Linux distributions have switched to using unified hierarchy by default yet and by the time they do they'll probably have a systemd version that includes that code, we should be OK (in any case, the libcontainer code will implicitly check that a systemd version that includes that feature will be running, so while "unified cgroup hierarchy" is the trigger, there are other components that need to be in place for this all to work correctly.)
This also helps answer the question of having all the features supported in the cgroup2 mode (such as cpuset and freezer cgroup), since we will only activate the new implementation once all the components are in place (kernel 5.2 for freezer support in cgroup2, systemd v244 or maybe v245 to include the D-Bus RPCs to expose cpuset and freezer, etc.) Again, using "unified cgroup hierarchy" as the trigger, but knowing there are more dependencies that need to be in place for that to work.
So I'll close #1991 for now, since the final approach will be different. I opened #2047 to push the constructor of the systemd cgroup driver into the systemd module itself, that way when there are two separate implementations of it (one for "legacy" and another for "unified"), that point can serve as detection to make the decision.
Thanks! Filipe
What's the next step?
@filbranden For unified mode, I think we only care really recent version of systemd. Maybe #1991 can be reopened?
I'm opening this issue to track improvements on systemd cgroup driver I and some others are working on.
The main motivations to improve the systemd cgroup driver are:
It's a path towards cgroupv2 support, since what the systemd API exports matches the feature set from cgroupv2. systemd also supports three hierarchies ("legacy", "hybrid" and "unified") to support Linux distributions and administrators in making a gradual migration to v2.
To support the "rootless" efforts, of running a container manager without root privileges. Projects such as usernetes are making strides in that direction. They rely on using systemd to manage cgroups, since writing to the cgroup tree directly would require root privileges.
To minimize future breakage with changes to kernel implementation and systemd versions. The systemd API encoding cgroupv2 attributes is expected to be stable and to be maintained even if future changes get introduced. New restrictions may start being imposed, to ensure the hierarchy works as expected (non-authoritative agents shouldn't be able to "hijack" resources from the machine), doing delegation correctly is a key step in preventing DoS attacks through the cgroup tree.
The steps I have in mind for this series of refactorings is:
Support working under an unified hierarchy. I have a branch which does most of that (assuming step 1 has been addressed, I'm taking some shortcuts there.) Setting properties goes through systemd, so paths don't matter there. Reading stats goes through the tree, but it's easy to detect which hierarchy is in use and to adjust paths to match the correct one.
Create sub-cgroups of a delegated systemd scope, to pass it to containers. This follows the recommendations from systemd, which are also based on the cgroupv2 implementation (particularly the workings of
cgroup.subtree_control
), so having libcontainer support this delegation protocol would definitely be advantageous.One large concern here is testing. I was talking to @mrunalp about this and he offered to help. Since running this setup on a container can be challenging, he suggested using an external test runner, possibly monitoring GitHub PRs and reporting changes back to GitHub. We'll be working on this testing and will report back here when we have more details. (We might open separate issues to track testing on systemd as well.)
I'm planning to join the OCI call on Wednesday March 13th (also to discuss related opencontainers/runtime-spec#1005 and opencontainers/runtime-spec#1002), so I'm happy to talk more about this libcontainer systemd cgroup driver improvement proposal.