Scheduling of restart jobs causes conflicts with already running stop jobs, failing the transaction

bl33pbl0p commented 5 years ago

systemd version the issue has been seen with

239

Used distribution

Debian

Problem Three units have requirement dependencies as follows.

% cat first_to_start.service 
[Unit]
Description=First to Start, Last to Stop
BindsTo=second_to_start.service third_to_start.service
Before=second_to_start.service third_to_start.service
[Service]
ExecStart=/bin/sleep infinity
ExecStop=/bin/sleep 1
Restart=always
% cat second_to_start.service 
[Unit]
Description=Second to Start, Second to Stop
BindsTo=first_to_start.service third_to_start.service
Before=third_to_start.service
After=first_to_start.service
[Service]
ExecStart=/bin/sleep infinity
ExecStop=/bin/sleep 1
Restart=always
% cat third_to_start.service 
[Unit]
Description=Last to Start, First to Stop
BindsTo=second_to_start.service first_to_start.service
After=second_to_start.service first_to_start.service
[Service]
ExecStart=/bin/sleep infinity
ExecStop=/bin/sleep 1
Restart=always
%

When any of these are running and the main pid dies due to failure (say I kill -9 it for third_to_start), the state changes to failed (on which the BindsTo= dep from the other two causes them to have a stob job enqueued). Meanwhile, before that stop job gets to complete, the 100ms RestartSec= for third_to_start.service is hit and a restart job with the 'fail' job mode is triggered. This transaction will try to start the unit again, pulling in start jobs for the dependents that already have a stop job queued (one waiting and one running as they happen in order). This causes the transaction to fail entirely deemed as destructive (since unmergeable due to the job mode), and all three units will then stop (instead of possibly waking up again).

This is racy, in fact, if RestartSec= is bumped to some higher value, the restart job completes successfully (as there are no more stob jobs at that point of time).

I think it would be appropriate to schedule the timer only after the stop jobs for its dependents are done, because bumping RestartSec= for cases where ExecStop= in either of the units takes an underterministic amount of time is not a practical solution, so is not bailing out on this. Another would be allowing setting the job mode for this transaction, so perhaps something like replace will make it work again (on the cost of abruptly cancelling stop jobs).

Logs of what happens:

% systemctl --user start third_to_start.service
Jan 01 17:03:26 jupiter systemd[972]: Started First to Start, Last to Stop.
Jan 01 17:03:26 jupiter systemd[972]: Started Second to Start, Second to Stop.
Jan 01 17:03:26 jupiter systemd[972]: Started Last to Start, First to Stop.

% systemd-analyze --user log-level debug; kill -9 4808
Jan 01 17:03:47 jupiter systemd[972]: Setting log level to debug.
Jan 01 17:03:47 jupiter systemd[972]: Received SIGCHLD from PID 4808 (sleep).
Jan 01 17:03:47 jupiter systemd[972]: Child 4808 (sleep) died (code=killed, status=9/KILL)
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Child 4808 belongs to third_to_start.service.
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Main process exited, code=killed, status=9/KILL
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Failed with result 'signal'.
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Changed running -> failed
Jan 01 17:03:47 jupiter systemd[972]: first_to_start.service: Trying to enqueue job first_to_start.service/stop/replace
Jan 01 17:03:47 jupiter systemd[972]: first_to_start.service: Installed new job first_to_start.service/stop as 497
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: Installed new job second_to_start.service/stop as 498
Jan 01 17:03:47 jupiter systemd[972]: first_to_start.service: Enqueued job first_to_start.service/stop as 497
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: Trying to enqueue job second_to_start.service/stop/replace
Jan 01 17:03:47 jupiter systemd[972]: first_to_start.service: Merged into installed job first_to_start.service/stop as 497
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: Merged into installed job second_to_start.service/stop as 498
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: Enqueued job second_to_start.service/stop as 498
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Unit entered failed state.
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Changed failed -> auto-restart
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: About to execute: /bin/sleep 1
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: Forked /bin/sleep as 4816
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: Changed running -> stop
Jan 01 17:03:47 jupiter systemd[972]: Stopping Second to Start, Second to Stop...
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Failed to set memory.limit_in_bytes: No such file or directory
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Failed to set pids.max: No such file or directory
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Service RestartSec=100ms expired, scheduling restart.
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Trying to enqueue job third_to_start.service/restart/fail
Jan 01 17:03:47 jupiter systemd[972]: Requested transaction contradicts existing jobs: Transaction is destructive.
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Failed to schedule restart job: Transaction is destructive.
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Failed with result 'signal'.
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Changed auto-restart -> failed
Jan 01 17:03:47 jupiter systemd[972]: first_to_start.service: Trying to enqueue job first_to_start.service/stop/replace
Jan 01 17:03:47 jupiter systemd[972]: first_to_start.service: Merged into installed job first_to_start.service/stop as 497
Jan 01 17:03:47 jupiter systemd[972]: second_to_start.service: Merged into installed job second_to_start.service/stop as 498
Jan 01 17:03:47 jupiter systemd[972]: first_to_start.service: Enqueued job first_to_start.service/stop as 497
Jan 01 17:03:47 jupiter systemd[972]: third_to_start.service: Unit entered failed state.
Jan 01 17:03:48 jupiter systemd[972]: Received SIGCHLD from PID 4816 (sleep).
Jan 01 17:03:48 jupiter systemd[972]: Child 4816 (sleep) died (code=exited, status=0/SUCCESS)
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Child 4816 belongs to second_to_start.service.
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Control process exited, code=exited status=0
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Got final SIGCHLD for state stop.
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Changed stop -> stop-sigterm
Jan 01 17:03:48 jupiter systemd[972]: Child 4807 (sleep) died (code=killed, status=15/TERM)
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Child 4807 belongs to second_to_start.service.
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Main process exited, code=killed, status=15/TERM
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Changed stop-sigterm -> dead
Jan 01 17:03:48 jupiter systemd[972]: second_to_start.service: Job second_to_start.service/stop finished, result=done
Jan 01 17:03:48 jupiter systemd[972]: Stopped Second to Start, Second to Stop.
Jan 01 17:03:48 jupiter systemd[972]: Received SIGCHLD from PID 4807 (n/a).
Jan 01 17:03:48 jupiter systemd[972]: first_to_start.service: About to execute: /bin/sleep 1
Jan 01 17:03:48 jupiter systemd[972]: first_to_start.service: Forked /bin/sleep as 4817
Jan 01 17:03:48 jupiter systemd[972]: first_to_start.service: Changed running -> stop
Jan 01 17:03:48 jupiter systemd[972]: Stopping First to Start, Last to Stop...
Jan 01 17:03:49 jupiter systemd[972]: Received SIGCHLD from PID 4817 (sleep).
Jan 01 17:03:49 jupiter systemd[972]: Child 4817 (sleep) died (code=exited, status=0/SUCCESS)
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Child 4817 belongs to first_to_start.service.
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Control process exited, code=exited status=0
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Got final SIGCHLD for state stop.
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Changed stop -> stop-sigterm
Jan 01 17:03:49 jupiter systemd[972]: Child 4806 (sleep) died (code=killed, status=15/TERM)
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Child 4806 belongs to first_to_start.service.
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Main process exited, code=killed, status=15/TERM
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Changed stop-sigterm -> dead
Jan 01 17:03:49 jupiter systemd[972]: first_to_start.service: Job first_to_start.service/stop finished, result=done
Jan 01 17:03:49 jupiter systemd[972]: Stopped First to Start, Last to Stop.
Jan 01 17:03:49 jupiter systemd[972]: Received SIGCHLD from PID 4806 (n/a).

% systemd-analyze --user log-level info
Jan 01 17:04:00 jupiter systemd[972]: Setting log level to info.

bl33pbl0p commented 5 years ago

This is related, and a similar problem.

https://lists.freedesktop.org/archives/systemd-devel/2019-January/041942.html

bl33pbl0p commented 5 years ago

Looking at the code, I can only think of three solutions so far, further discussion would be appreciated:

Allow setting the job mode for the Restart= job manager triggers. Useful for debugging ~~Causes stop jobs to not be processed correctly when used, often leading to unclean termination.~~ It seems it does wait for the control processes that were spawned as part of running job to exit, returning EAGAIN internally to job_is_runnable from unit_start (coming from the start function of the vtable corresponding to service_start (so it waits until all of that is done). Hence, it seems this might be an acceptable solution. Cancel the running stop job, and install a start job instead, and avoid all of the other complexity.

This is only for services though, I am not sure how this mixes with other unit types...

Wait for any conflicting jobs of dependencies to be processed before scheduling timer: Easy fix, but is racy (what if some conflicting job comes in right before we schedule the timer and conflicts again).
Introduce a concept of job queue per unit Probably the most involved but also the correct solution. Allow, through a job mode, to queue jobs on a unit. One way is to make it possible to install multiple jobs (maybe have a limited number of slots), and then when processing a job for a unit, do it in order they were installed. It is unclear however if all jobs installed in a unit should be done atomically, and how ordering has effect. I guess it would be to check if u->job_after is non-NULL after one processes u->job and then not return JOB_DONE but instead go to dispatch it iff the previous job had the JOB_DONE result (assuming we allow for two slots). This however needs more reworking as we also need to respect ordering of any other jobs in queue that might just be waiting on stop (and need to happen before start).

Now, two slots has some interesting effect in my testing. Any further requests with, say, the 'append' job mode, may replace what was wrongly installed after the running job for unit previously.

Another way to fake this queuing effect is to mark the job as merged but re-run it, like JOB_RELOAD, but this probably involves exploring the combinatorial matrix of all possible conflicting combinations and how they should be dealt with, and again has interesting effects with ordering of other jobs of other units.

Shuangistan commented 5 years ago

The udev problem is not solved. https://lists.freedesktop.org/archives/systemd-devel/2019-January/041942.html I will create another ticket.

bl33pbl0p commented 5 years ago

@Shuangistan Yes, that problem is related to device state machine using JOB_FAIL, it is similar in nature to this one, hence feel free to open a new ticket to track it.

systemd / systemd

Scheduling of restart jobs causes conflicts with already running stop jobs, failing the transaction #11305