RFE: use systemd user services to spawn subprocesses

keszybz commented 1 month ago

[A continuation of the discussion in https://discussion.fedoraproject.org/t/help-wanted-for-fedora-cosmic/106769.]

Hi, I'm an systemd developer and a recent discussion in Fedora prompted me to test COSMIC (precisely ublue-os/cosmic-base:40-amd64) on a spare laptop.

COSMIC currently just puts everything in one giant systemd service, which means that everything ends up in one giant cgroup. This is not good. It’s essentially a regression to where GNOME and KDE were 2 years ago. With a bit of pain, those DEs converted to a scheme where the services and applets that are started by the DE are not spawned directly, but instead through the systemd API which puts the program in a separate service under the user instance of systemd.

Segregating apps into separate systemd user services has the following advantages:

we can stop (gracefully or forcefully) the service, incl. if it has spawned subprocesses
we get accounting of resources and can set limits and priorities. (The kernel implements this for us, systemd just provides a convenient interface.)
the DE can start things asynchronously and systemd will take care of scheduling jobs relative to one another
systemd-oomd works properly
systemd will automatically respawn services (if configured)
we get some sandboxing (this is a moving target, and unprivileged user systemd can only implement a subset of what the system instance allows, but progress is being made).
the DE does not have to implement the code to spawn subprocesses, reap children, take care of hung processes, detect readiness.
274 can be handled by creating a symlink in the right directory

So… I think COSMIC looks very promising, and I like the flexibility, but I would also love to avoid a regression wrt. to how apps are started. I'm opening the ticket here with the hope of starting a discussion.

The cgroup tree under GNOME: ```console $ systemd-cgls /user.slice/user-1000.slice/user@1000.service ├─session.slice │ ├─gvfs-goa-volume-monitor.service │ │ └─9250 /usr/libexec/gvfs-goa-volume-monitor │ ├─org.gnome.SettingsDaemon.MediaKeys.service │ │ └─8914 /usr/libexec/gsd-media-keys │ ├─org.gnome.SettingsDaemon.Smartcard.service │ │ └─8974 /usr/libexec/gsd-smartcard │ ├─xdg-permission-store.service │ │ └─7749 /usr/libexec/xdg-permission-store │ ├─dbus-broker.service │ │ ├─7655 /usr/bin/dbus-broker-launch --scope user │ │ └─7659 dbus-broker --log 4 --controller 9 --machine-id 3a9d668b4db749398a4a5e78a03bffa5 --max-bytes 100000000000000 --max-fds 2> │ ├─org.gnome.SettingsDaemon.Datetime.service │ │ └─8904 /usr/libexec/gsd-datetime │ ├─xdg-document-portal.service │ │ ├─9173 /usr/libexec/xdg-document-portal │ │ └─9195 fusermount3 -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /run/user/1000/doc │ ├─org.gnome.SettingsDaemon.Housekeeping.service │ │ └─8907 /usr/libexec/gsd-housekeeping │ ├─xdg-desktop-portal.service │ │ └─9399 /usr/libexec/xdg-desktop-portal │ ├─org.freedesktop.IBus.session.GNOME.service │ │ ├─ 8891 /usr/bin/ibus-daemon --panel disable │ │ ├─ 9070 /usr/libexec/ibus-dconf │ │ ├─ 9071 /usr/libexec/ibus-extension-gtk3 │ │ ├─ 9272 /usr/libexec/ibus-engine-simple │ │ └─10257 /usr/bin/python3 /usr/share/ibus-typing-booster/engine/main.py --ibus ... ``` Under COSMIC: ```console CGroup /user.slice/user-1000.slice/session-2.scope: ├─3530 /usr/bin/greetd --session-worker 12 ├─3570 /usr/bin/cosmic-session ├─3637 cosmic-comp ├─3647 Xwayland :1 -verbose -rootless -terminate -wm 24 -displayfd 41 -listenfd… ├─3662 cosmic-settings-daemon ├─3665 /usr/libexec/geoclue-2.0/demos/agent ├─3667 cosmic-notifications ├─3678 cosmic-panel ├─3686 cosmic-app-library ├─3699 cosmic-launcher ├─3704 cosmic-workspaces ├─3711 cosmic-osd ├─3715 cosmic-bg ├─3726 cosmic-greeter ├─3728 /usr/libexec/xdg-desktop-portal-cosmic ├─3878 cosmic-applet-audio ├─3879 cosmic-applet-battery ├─3881 cosmic-applet-input-sources ├─3882 cosmic-applet-network ├─3883 cosmic-applet-power ├─3884 cosmic-applet-status-area ├─3885 cosmic-applet-tiling ├─3886 cosmic-applet-time ├─3887 cosmic-applet-workspaces ├─3888 cosmic-panel-button com.system76.CosmicAppLibrary ├─3965 cosmic-term ├─3990 /bin/bash ├─5560 pop-launcher ├─5567 /usr/lib/pop-launcher/plugins/cosmic_toplevel/cosmic-toplevel ├─5568 /usr/lib/pop-launcher/plugins/pop_shell/pop-shell └─5592 systemd-cgls /user.slice/user-1000.slice/session-2.scope ```

Drakulix commented 1 month ago

While this is indeed desirable, using systemd in this way would break our current privilege model, as we pass file descriptors around to not expose privileged services on the file system. This is the main reason everything is started by cosmic-session in the first place. Afaik this is not something supported through e.g. systemd-run.

keszybz commented 1 month ago

Clarification: systemd-run is generally not the right option for this kind of thing. It is a command-line tool that is a thin wrapper around the relavant APIs exposed by the manager. Those APIs are stable and documented and should be called directly.

Systemd supports storage and passing of sockets very much. I'm not sure which one would be the most relevant here, and I don't want to describe all the possibilities. Can you describe or point me to some overview of how (and why) COSMIC is passing those fds?

Drakulix commented 1 month ago

Clarification: systemd-run is generally not the right option for this kind of thing. It is a command-line tool that is a thin wrapper around the relavant APIs exposed by the manager. Those APIs are stable and documented and should be called directly.

Noted, however I was expecting the systemd-run tool to match the underlying api surface in capabilities and I found no way of passing existing file descriptors

Systemd supports storage and passing of sockets very much. I'm not sure which one would be the most relevant here, and I don't want to describe all the possibilities. Can you describe or point me to some overview of how (and why) COSMIC is passing those fds?

cosmic-session is sharing a pipe with cosmic-comp. It uses that for a bit of initial communication, e.g. to share the WAYLAND_DISPLAY and DISPLAY settings upon startup, which are then imported into the systemd user session (if available) and set on newly spawned processes.

Additionally a specific command from cosmic-session on that particular connection, will cause cosmic-comp to create a new pipe, attach one end as a wayland-client and pass the other back to cosmic-session over the previously established connection. Connections created this way have additional privileges in the form of additionally exposed protocols or altered behaviour existing protocols and are meant for shell-clients. cosmic-session will pass these to the various component it starts via forking and setting WAYLAND_SOCKET.

Additional pipe are passed via forking to relay notifications between a few components and cosmic-panel does a similar thing internally, where it passes wayland-sockets of it's internal compositor (as every applet is a wayland-client itself) via WAYLAND_SOCKET.

keszybz commented 1 month ago

cosmic-session is sharing a pipe with cosmic-comp.

If cosmic-session and cosmic-comp are very closely connected, maybe they should be parts of the same unit. In particular, can either be restarted without the other one going down?

a specific command from cosmic-session on that particular connection, will cause cosmic-comp to create a new pipe, attach one end as a wayland-client and pass the other back to cosmic-session over the previously established connection

That is an interesting problem.

One option is to use systemd "scopes". (In systemd parlance, a "service" is a unit that is spawned as a child of systemd, and a "scope" is a unit that that is spawned by something else and then the unit is created based on a list of PIDs and those processes are moved to a new cgroup and managed by systemd.) This would mean that cosmic-session would spawn the child as it does now, and then call org.freedesktop.systemd1.Manager.StartTransientUnit with PIDs property set. See https://systemd.io/CONTROL_GROUP_INTERFACE/#properties.

Another option would be to extend the systemd API to support this better. We have APIs that cover stdin/stdout/stderr, but not other fds. See https://github.com/systemd/systemd/issues/33061.

Drakulix commented 1 month ago

If cosmic-session and cosmic-comp are very closely connected, maybe they should be parts of the same unit. In particular, can either be restarted without the other one going down?

The compositor can indeed be restarted and cosmic-session will restart the shell-components in the right order to restore the session. This isn't meant as a way for arbitrary restart though and more of a way to handle crashes and/or aid development/debugging. (At least until something like KDE/Qt's proposed compositor-handoff becomes a more widely supported thing.)

One option is to use systemd "scopes". (In systemd parlance, a "service" is a unit that is spawned as a child of systemd, and a "scope" is a unit that that is spawned by something else and then the unit is created based on a list of PIDs and those processes are moved to a new cgroup and managed by systemd.) This would mean that cosmic-session would spawn the child as it does now, and then call org.freedesktop.systemd1.Manager.StartTransientUnit with PIDs property set. See https://systemd.io/CONTROL_GROUP_INTERFACE/#properties.

This would probably be the easiest option to support right now and also seems to make it easy to keep the systemd-integration optional. As far as I understand it, if systemd is available we should just add a call to create the scope?

This gains us proper cgroups and the perks that come with that, while still allowing cosmic-session to manage the child process, right? (Most notably watch it's exit status and start it again if necessary with a new socket.)

Another option would be to extend the systemd API to support this better. We have APIs that cover stdin/stdout/stderr, but not other fds. See systemd/systemd#33061.

I am not sure how exactly that would work, but it is definitely interesting to think about. I would like to move more responsibilities to an existing service manager opposed to essentially writing our own. This has only really developed out of the necessity to transfer file descriptors to our child processes.

So in that scenario we would pass file descriptors to the systemd-call, I assume? And systemd would make sure to pass them, when starting the process? I guess we would need to figure out how restarting would exactly work in that case.

keszybz commented 1 month ago

This would probably be the easiest option to support right now and also seems to make it easy to keep the systemd-integration optional. As far as I understand it, if systemd is available we should just add a call to create the scope?

This gains us proper cgroups and the perks that come with that, while still allowing cosmic-session to manage the child process, right? (Most notably watch it's exit status and start it again if necessary with a new socket.)

Yes.

So in that scenario we would pass file descriptors to the systemd-call, I assume? And systemd would make sure to pass them, when starting the process? I guess we would need to figure out how restarting would exactly work in that case.

Yes and yes.

For restarting, since a new fd needs to be passed, I think the only option is to stop the previous instance and start a new one from scratch. Normally we want restarts to pass along state, but in this case it sounds like any state becomes invalid during the restart.

Drakulix commented 1 month ago

Yes.

Great. Thanks for the links and bringing this to my attention. I will see to it as soon as as time permits (likely after alpha 1). In case anybody reading this or you personally see this as an opportunity to contribute, PRs are very welcome.

For restarting, since a new fd needs to be passed, I think the only option is to stop the previous instance and start a new one from scratch. Normally we want restarts to pass along state, but in this case it sounds like any state becomes invalid during the restart.

Makes sense, so in the end that doesn't give us much more than scopes?

I guess we could also rework the api to use some more long-lived anonymous sockets, than a pipe, so we could re-use the same fd?

I don't think such an api exists however, I guess the closest would be an abstract socket and some authentication by sharing a secret with the service..

Not sure if that is worth the effort, at least not in the short term.

keszybz commented 1 month ago

Makes sense, so in the end that doesn't give us much more than scopes?

A service gets a clean environment and the possibility to apply sandboxing or other setup. With a scope, this would need to be done by the caller.

I don't think such an api exists however, I guess the closest would be an abstract socket and some authentication by sharing a secret with the service..

Authentication is not useful. All those processes are executed under the same user, so normal linux privilege separation already gives appropriate security properties. There is no real difference between a pipe, a socket in the file system, or something more elaborate like a connection with some form of authentication. Since all the processes are running under the same user, a rogue processes that would be able to open the connection would in most circumstances be able to ptrace the session process and extract any secrets.

If the API used for communication between processes is changed, I guess a UNIX socket with SOCK_SEQPACKET could be a good choice.

Great. Thanks for the links and bringing this to my attention. I will see to it as soon as as time permits (likely after alpha 1). In case anybody reading this or you personally see this as an opportunity to contribute, PRs are very welcome.

Cool! I guess that the design part might be more complicated than the actual code changes…

From the systemd side we'll want to add the API to start services with fds. It's a generic problem and I'm a bit surprised that it hasn't been implemented yet.

skewballfox commented 1 month ago

I was looking into this a bit, sorry for any dumb questions.

I'm guessing the necessary changes (for optionally using Cgroups) would go at the tail end of start_component, and would involve either run_optional_command (like the other systemd related calls), or checking for some env variable to indicate Cgroups should be used?
Is the ProcessKey returned from ProcessManager::start a PID? If not, is there an existing function in launchpad to get a PID for a process?

keszybz commented 1 month ago

* I'm guessing the necessary changes (for optionally using Cgroups) would go at the tail end of [`start_component`](https://github.com/pop-os/cosmic-session/blob/5613bc660649c65b4a4c3fb41605491b9765729a/src/main.rs#L332), and would involve either run_optional_command (like the other systemd related calls), or checking for some env variable to indicate Cgroups should be used?

I can't comment on the side of the cosmic code, but just one clarification: the set up of the cgroup and moving of the procesess to the cgroup must be done by systemd, i.e. the code in cosmic must do a DBus call to systemd.

In the approach with scopes, the sequence would be that the process is spawned and then a call is made to move it to a scope, so yeah, somewhere at the end of start_component sounds right.

skewballfox commented 2 weeks ago

from new control group interfaces

CPUAccounting, CPUShares, BlockIOAccounting, BlockIOWeight, BlockIOReadBandwidth, BlockIOWriteBandwidth, BlockIODeviceWeight, MemoryAccounting, MemoryLimit, DevicePolicy, DeviceAllow for services/scopes/slices.

@keszybz I'm trying to figure out what, if any, of these should be exposed or set for spawned scopes, and I was hoping you might have more information.

CPUAccounting seems to do nothing if cgroup v2 (unified hierarchy) is used, which I think is the default at this point right?
CPUShares, MemoryLimit, and all BlockIO properties are deprecated (according to same link under history)
Would MemoryAccounting propagate to all of the spawned scopes if we are not currently (explicitly) grouping them under a slice?
The only ones left are DeviceAllow and DevicePolicy which I think make sense to expose but I'm not sure

Also, when is it a good Idea to group multiple processes under a single scope?

keszybz commented 2 weeks ago

CPUAccounting seems to do nothing if cgroup v2 (unified hierarchy) is used, which I think is the default at this point right?

Yes. Anything related to cgroups v1 can be ignored at this point. V1 is on its way out, and by the time this work here goes out to users, systemd will have dropped support for it.

CPUShares, MemoryLimit, and all BlockIO properties are deprecated (according to same link under history)

Would MemoryAccounting propagate to all of the spawned scopes if we are not currently (explicitly) grouping them under a slice?

Yes, because the spawned scoped would be siblings.

The only ones left are DeviceAllow and DevicePolicy which I think make sense to expose but I'm not sure

Taking a step back, I don't think that trying to allow-/deny-list which systemd settings can be used is useful. There's just too many settings, and new ones are added very often… Why not just allow any key=value pairs and pass them through? This will reduce you maintanance effort.

Also, when is it a good Idea to group multiple processes under a single scope?

When those processes are part of a single "thing". E.g. a controller process and 5 workers, or a user shell with 'ls' and 'vi' spawned from it…

Drakulix commented 6 days ago

Initial support for this has been merged: https://github.com/pop-os/cosmic-session/pull/54

I think we can close this issue and open new ones to track progress for making better use of scopes or potentially moving the systemd-services, once new interfaces allow for file descriptor passing.

@keszybz Thanks for starting this, do you mind getting pinged, once we look into further work on this front for some insights? Anything else you would like to see cosmic support in this regard, that I am missing?

pop-os / cosmic-epoch

RFE: use systemd user services to spawn subprocesses #356

274 can be handled by creating a symlink in the right directory