This PR address the issue of canary false-positives cased by system-suspend by adding new public methods of "Suspend()" and "Resume()", as well as optionally connecting them to logind system-sleep handling.
The Bug
During a system suspend-resume (sleep) cycle, the canary thread often experiences a time jump which causes a starvation false-positive. rtkit takes action and demotes the realtime/high priority of all known threads.
Long running realtime processes (Pipewire, Pulseaudio) generally only request realtime/high priority once. If a system goes to sleep, the realtime/high priority scheduling is lost until these long-running processes are next started, after logout and login. As users generally suspend their machines more often than logging in, rtkit is basically non-functional for these processes, arguably the most important processes to use rtkit.
Even non-long-running processes may have lifecycles which span system suspend-resume cycles, and so operate in a degraded way for users.
With the view that the primary bug this change seeks to address is the canary false positives, it would seem to be far simpler to only start and stop use of the the canary during suspend. However, doing so would degrade security for a controllable window. From a security perspective, one might as well just disable the canary altogether. To safely disable the canary, we need to first demote all threads.
These temporarily demote and restore managed thread priorities, as well as stop and start the canary.
On Suspend(), all managed threads are demoted, and the canary stopped.
While suspended, new realtime/high priority requests are rejected. Managed thread states are still garbage if a thread exists, but are retained otherwise.
On Resume() the canary is restarted, and all managed threads are re-promoted. Current user burst limit timeouts are restarted, and the re-promotion of threads counts toward burst limiting, but the burst limit is not enforced on the re-promotion.
Calling ResetKnown() or ResetAll() while suspended removes all managed threads which lack realtime/high-priority, leaving no threads to re-promote later.
Calling either Suspend() and Resume() multiple times in a row is fine, but only the first call has an effect.
Security Considerations
Suspend() and Resume() are only available to admin callers, preventing abuse. Notwithstanding, if a malicious user was able to call suspend and resume at will, they still could not circumvent the count or burst limits. No new threads promotions can be created when suspended. Further, while the user burst limit is not enforced on resume, it is still updated, and the burst timeout restarted.
It may be safe to allow for new realtime/high priority grants while in suspended mode to take effect upon resume, but this is an unlikely case, so it's easier to just refuse.
logind Integration
This change also adds an optional runtime integration with logind's inhibitor locks for handling system-suspend.
If the logind dbus service is running and accessible, rtkit will register a "delay sleep inhibitor", and listen for signals from logind about when the system is going to sleep or having just woken up. Using the sleep inhibitor, logind will wait for rtkit to perform it's Suspend() operation before letting the system suspend. On system resume, logind will again notify rtkit, which will perform Resume() and register a new inhibitor.
No alternate automatic system-suspend integration is provided, but rtkitctl --suspend and rtkitctl --resume should make this task easy.
Other Changes
Rename priority (dynamic) to nice_level inside of process_set_high_priority(). Helps differentiate it from priority (static) as used by process_set_realtime(). Also, it's called nice_level everywhere else in the code.
Reduce log spam by not printing a message for every handled dbus message, as that includes dbus introspection and properties related messages. Some programs (Firefox in my case) get rtkit properties more frequently than I would think necessary.
This PR address the issue of canary false-positives cased by system-suspend by adding new public methods of "Suspend()" and "Resume()", as well as optionally connecting them to
logind
system-sleep handling.The Bug
During a system suspend-resume (sleep) cycle, the canary thread often experiences a time jump which causes a starvation false-positive.
rtkit
takes action and demotes the realtime/high priority of all known threads.Long running realtime processes (Pipewire, Pulseaudio) generally only request realtime/high priority once. If a system goes to sleep, the realtime/high priority scheduling is lost until these long-running processes are next started, after logout and login. As users generally suspend their machines more often than logging in, rtkit is basically non-functional for these processes, arguably the most important processes to use rtkit.
Even non-long-running processes may have lifecycles which span system suspend-resume cycles, and so operate in a degraded way for users.
See
Why
With the view that the primary bug this change seeks to address is the canary false positives, it would seem to be far simpler to only start and stop use of the the canary during suspend. However, doing so would degrade security for a controllable window. From a security perspective, one might as well just disable the canary altogether. To safely disable the canary, we need to first demote all threads.
Suspend/Resume Operation
Two new admin operations are added to rtkit.
org.freedesktop.RealtimeKit1.Suspend()
,rtkitctl --suspend
org.freedesktop.RealtimeKit1.Resume()
,rtkitctl --resume
These temporarily demote and restore managed thread priorities, as well as stop and start the canary.
On
Suspend()
, all managed threads are demoted, and the canary stopped.While suspended, new realtime/high priority requests are rejected. Managed thread states are still garbage if a thread exists, but are retained otherwise.
On
Resume()
the canary is restarted, and all managed threads are re-promoted. Current user burst limit timeouts are restarted, and the re-promotion of threads counts toward burst limiting, but the burst limit is not enforced on the re-promotion.Calling
ResetKnown()
orResetAll()
while suspended removes all managed threads which lack realtime/high-priority, leaving no threads to re-promote later.Calling either
Suspend()
andResume()
multiple times in a row is fine, but only the first call has an effect.Security Considerations
Suspend()
andResume()
are only available to admin callers, preventing abuse. Notwithstanding, if a malicious user was able to call suspend and resume at will, they still could not circumvent the count or burst limits. No new threads promotions can be created when suspended. Further, while the user burst limit is not enforced on resume, it is still updated, and the burst timeout restarted.It may be safe to allow for new realtime/high priority grants while in suspended mode to take effect upon resume, but this is an unlikely case, so it's easier to just refuse.
logind
IntegrationThis change also adds an optional runtime integration with logind's inhibitor locks for handling system-suspend.
If the
logind
dbus service is running and accessible,rtkit
will register a "delay sleep inhibitor", and listen for signals from logind about when the system is going to sleep or having just woken up. Using the sleep inhibitor, logind will wait for rtkit to perform it'sSuspend()
operation before letting the system suspend. On system resume, logind will again notify rtkit, which will performResume()
and register a new inhibitor.See https://www.freedesktop.org/wiki/Software/systemd/inhibit/
Alternate Integrations
No alternate automatic system-suspend integration is provided, but
rtkitctl --suspend
andrtkitctl --resume
should make this task easy.Other Changes
Rename
priority
(dynamic) tonice_level
inside ofprocess_set_high_priority()
. Helps differentiate it frompriority
(static) as used byprocess_set_realtime()
. Also, it's callednice_level
everywhere else in the code.Reduce log spam by not printing a message for every handled dbus message, as that includes dbus introspection and properties related messages. Some programs (Firefox in my case) get rtkit properties more frequently than I would think necessary.