Closed JackNewman12 closed 1 year ago
Thanks again for your detailed reports, very appreciated! I'll have a look at it, starting with a regression test to reproduce the issue.
I agree with your suggestion in 2. above, it'll likely solve the problem.
There, it works better now, all retry jobs are cancelled when a service loses conditions, and their state machines are unblocked and driven to 'waiting' state as expected (without passing through the retry job). I hope this is to your satisfactory.
Unfortunately it's impossible to fix it as we talked about (stop dependant tasks before stopping the one the user asked for), without a rather large refactor, because of the way conditions are implemented -- it's a chicken and the egg situation, basically. I'd rather not start that refactor now since I fear that would delay this release way into next year, and we're already late as it is.
This is almost fixed.
Having a requirement of <pid/parent>
works perfectly.
But doing <service/parent/running>
still has the running with pid 0 issue.
I guess its the missing cond_update(): match
line.
finit: api_cb():svc query: parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: api_cb():stop parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: service_step(): parent(4537): running disabled/clean cond:on
finit: service_stop():Sending SIGTERM to pid:4537 name:parent
finit: Stopping parent[4537], sending SIGTERM ...
finit: service_stop():kill(-4537, 15) => rc 0, errno 2
finit: svc_set_state():parent is stopping, wait 3 sec before sending SIGKILL ...
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: service_step(): parent(4537): -> stopping
finit: service_step(): parent(4537): stopping disabled/clean cond:on
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000008
finit: pidfile_update_conds():Found svc myservice for /var/run/myservice.pid with pid 4539
finit: pidfile_update_conds():Setting myservice PID file to /var/run/myservice.pid
finit: cond_set():pid/myservice
finit: cond_set_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 2
finit: cond_set():service/myservice/ready
finit: cond_set_noupdate():service/myservice/ready
finit: cond_set_path():/var/run/finit/cond/service/myservice/ready <= 2
finit: cond_update():service/myservice/ready
finit: sigchld_cb():Collected child 4537, status: 256
finit: service_monitor():collected parent(4537), normal exit: 1, signaled: 0, exit code: 1
finit: service_step(): parent( 0): stopping disabled/clean cond:on
finit: service_step():parent: stopped, cleaning up timers and conditions ...
finit: cond_clear():pid/parent
finit: cond_clear_noupdate():pid/parent
finit: cond_set_path():/var/run/finit/cond/pid/parent <= 0
finit: cond_update():pid/parent
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: cond_set_oneshot():service/parent/stopped
finit: cond_set_oneshot_noupdate():service/parent/stopped => /var/run/finit/cond/service/parent/stopped
finit: cond_update():service/parent/stopped
finit: service_step(): parent( 0): -> stopped
finit: service_step(): parent( 0): stopped disabled/clean cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: sigchld_cb():Collected child 4539, status: 0
finit: service_monitor():collected myservice(4539), normal exit: 1, signaled: 0, exit code: 0
finit: service_step(): myservice( 0): running enabled/clean cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/restart
finit: cond_set_oneshot_noupdate():service/myservice/restart => /var/run/finit/cond/service/myservice/restart
finit: cond_update():service/myservice/restart
finit: service_step():delayed restart of myservice
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000008
finit: pidfile_update_conds():No matching svc for /var/run/parent.pid
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/parent.pid
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/myservice.pid
finit: service_retry():myservice crashed, trying to start it again, attempt 5
finit: Service myservice[4539] died, restarting in 2000 msec (5/10)
finit: service_step(): myservice( 0): halted enabled/clean cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step(): myservice( 0): -> waiting
finit: service_step(): myservice( 0): waiting enabled/clean cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: Successfully restarted crashing service myservice.
Hmm, OK I'll have a look. Thanks for testing!
Reproduced the issue with <service/parent/running>
. Have a pretty good idea what it is.
Puh, that was great! Awesome finds, thank you so much @JackNewman12 :clap::clap::clap:
When a service crashes, there is a race condition between the service retry timer firing, and a condition becoming false.
parent
servicemyservice
crashes due toparent
disappearing (I guess before finit has had a chance to stop it? TBD I think finit should probably kill "dependent" services before stopping the user requested one.)myservice
retry timer startsinitctl cond
shows<service/parent/running>
as offmyservice
timer gets hit and the service becomesrunning
with pid of0
. Same issue as #313I can't seem to get this same race condition using the
/usr
feature. I guess theHUP
must cover it up.If I do a
initctl stop myservice; initctl start myservice
then I correctly get thewaiting
status onmyservice
The log: