troglobit / finit

Fast init for Linux. Cookies included
https://troglobit.com/projects/finit/
MIT License
622 stars 61 forks source link

Race condition between services crashing and conditions toggling #314

Closed JackNewman12 closed 1 year ago

JackNewman12 commented 1 year ago

When a service crashes, there is a race condition between the service retry timer firing, and a condition becoming false.

service <service/parent/running> /bin/myservice
## or
service <pid/parent> /bin/myservice
  1. Stop parent service
  2. myservice crashes due to parent disappearing (I guess before finit has had a chance to stop it? TBD I think finit should probably kill "dependent" services before stopping the user requested one.)
  3. myservice retry timer starts
  4. initctl cond shows <service/parent/running> as off
  5. myservice timer gets hit and the service becomes running with pid of 0 . Same issue as #313

I can't seem to get this same race condition using the /usr feature. I guess the HUP must cover it up.

If I do a initctl stop myservice; initctl start myservice then I correctly get the waiting status on myservice

The log:


finit: api_cb():svc query: parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: api_cb():stop parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: service_step():                 parent(7633):  running disabled/clean   cond:on
finit: service_stop():Sending SIGTERM to pid:7633 name:parent
finit: Stopping parent[7633], sending SIGTERM ...
finit: service_stop():kill(-7633, 15) => rc 0, errno 2
finit: svc_set_state():parent is stopping, wait 3 sec before sending SIGKILL ...
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: service_step():                 parent(7633): -> stopping
finit: service_step():                 parent(7633): stopping disabled/clean   cond:on
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000008
finit: pidfile_update_conds():Found svc parent for /var/run/parent.pid with pid 7633
finit: pidfile_update_conds():Setting parent PID file to /var/run/parent.pid
finit: cond_set():pid/parent
finit: cond_set_noupdate():pid/parent
finit: cond_set_path():/var/run/finit/cond/pid/parent <= 2
finit: cond_set():service/parent/ready
finit: cond_set_noupdate():service/parent/ready
finit: cond_set_path():/var/run/finit/cond/service/parent/ready <= 2
finit: cond_update():service/parent/ready
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000008
finit: pidfile_update_conds():Found svc myservice for /var/run/myservice.pid with pid 7634
finit: pidfile_update_conds():Setting myservice PID file to /var/run/myservice.pid
finit: cond_set():pid/myservice
finit: cond_set_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 2
finit: cond_set():service/myservice/ready
finit: cond_set_noupdate():service/myservice/ready
finit: cond_set_path():/var/run/finit/cond/service/myservice/ready <= 2
finit: cond_update():service/myservice/ready
finit: sigchld_cb():Collected child 7633, status: 256
finit: service_monitor():collected parent(7633), normal exit: 1, signaled: 0, exit code: 1
finit: service_step():                 parent(   0): stopping disabled/clean   cond:on
finit: service_step():parent: stopped, cleaning up timers and conditions ...
finit: cond_clear():pid/parent
finit: cond_clear_noupdate():pid/parent
finit: cond_set_path():/var/run/finit/cond/pid/parent <= 0
finit: cond_update():pid/parent
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: cond_set_oneshot():service/parent/stopped
finit: cond_set_oneshot_noupdate():service/parent/stopped => /var/run/finit/cond/service/parent/stopped
finit: cond_update():service/parent/stopped
finit: service_step():                 parent(   0): ->  stopped
finit: service_step():                 parent(   0):  stopped disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: sigchld_cb():Collected child 7634, status: 0
finit: service_monitor():collected myservice(7634), normal exit: 1, signaled: 0, exit code: 0
finit: service_step():               myservice(   0):  running  enabled/clean   cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/restart
finit: cond_set_oneshot_noupdate():service/myservice/restart => /var/run/finit/cond/service/myservice/restart
finit: cond_update():service/myservice/restart
finit: service_step():delayed restart of myservice
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/parent.pid
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/myservice.pid
finit: service_retry():myservice crashed, trying to start it again, attempt 5
finit: Service myservice[7634] died, restarting in 2000 msec (5/10)
finit: service_step():               myservice(   0):   halted  enabled/clean   cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step():               myservice(   0): ->  waiting
finit: service_step():               myservice(   0):  waiting  enabled/clean   cond:off
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: Successfully restarted crashing service myservice.
troglobit commented 1 year ago

Thanks again for your detailed reports, very appreciated! I'll have a look at it, starting with a regression test to reproduce the issue.

I agree with your suggestion in 2. above, it'll likely solve the problem.

troglobit commented 1 year ago

There, it works better now, all retry jobs are cancelled when a service loses conditions, and their state machines are unblocked and driven to 'waiting' state as expected (without passing through the retry job). I hope this is to your satisfactory.

Unfortunately it's impossible to fix it as we talked about (stop dependant tasks before stopping the one the user asked for), without a rather large refactor, because of the way conditions are implemented -- it's a chicken and the egg situation, basically. I'd rather not start that refactor now since I fear that would delay this release way into next year, and we're already late as it is.

JackNewman12 commented 1 year ago

This is almost fixed.

Having a requirement of <pid/parent> works perfectly.

But doing <service/parent/running> still has the running with pid 0 issue.

I guess its the missing cond_update(): match line.


finit: api_cb():svc query: parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: api_cb():stop parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: service_step():                 parent(4537):  running disabled/clean   cond:on
finit: service_stop():Sending SIGTERM to pid:4537 name:parent
finit: Stopping parent[4537], sending SIGTERM ...
finit: service_stop():kill(-4537, 15) => rc 0, errno 2
finit: svc_set_state():parent is stopping, wait 3 sec before sending SIGKILL ...
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: service_step():                 parent(4537): -> stopping
finit: service_step():                 parent(4537): stopping disabled/clean   cond:on
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000008
finit: pidfile_update_conds():Found svc myservice for /var/run/myservice.pid with pid 4539
finit: pidfile_update_conds():Setting myservice PID file to /var/run/myservice.pid
finit: cond_set():pid/myservice
finit: cond_set_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 2
finit: cond_set():service/myservice/ready
finit: cond_set_noupdate():service/myservice/ready
finit: cond_set_path():/var/run/finit/cond/service/myservice/ready <= 2
finit: cond_update():service/myservice/ready
finit: sigchld_cb():Collected child 4537, status: 256
finit: service_monitor():collected parent(4537), normal exit: 1, signaled: 0, exit code: 1
finit: service_step():                 parent(   0): stopping disabled/clean   cond:on
finit: service_step():parent: stopped, cleaning up timers and conditions ...
finit: cond_clear():pid/parent
finit: cond_clear_noupdate():pid/parent
finit: cond_set_path():/var/run/finit/cond/pid/parent <= 0
finit: cond_update():pid/parent
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: cond_set_oneshot():service/parent/stopped
finit: cond_set_oneshot_noupdate():service/parent/stopped => /var/run/finit/cond/service/parent/stopped
finit: cond_update():service/parent/stopped
finit: service_step():                 parent(   0): ->  stopped
finit: service_step():                 parent(   0):  stopped disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: sigchld_cb():Collected child 4539, status: 0
finit: service_monitor():collected myservice(4539), normal exit: 1, signaled: 0, exit code: 0
finit: service_step():               myservice(   0):  running  enabled/clean   cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/restart
finit: cond_set_oneshot_noupdate():service/myservice/restart => /var/run/finit/cond/service/myservice/restart
finit: cond_update():service/myservice/restart
finit: service_step():delayed restart of myservice
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000008
finit: pidfile_update_conds():No matching svc for /var/run/parent.pid
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/parent.pid
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/myservice.pid
finit: service_retry():myservice crashed, trying to start it again, attempt 5
finit: Service myservice[4539] died, restarting in 2000 msec (5/10)
finit: service_step():               myservice(   0):   halted  enabled/clean   cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step():               myservice(   0): ->  waiting
finit: service_step():               myservice(   0):  waiting  enabled/clean   cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: Successfully restarted crashing service myservice.
troglobit commented 1 year ago

Hmm, OK I'll have a look. Thanks for testing!

troglobit commented 1 year ago

Reproduced the issue with <service/parent/running>. Have a pretty good idea what it is.

troglobit commented 1 year ago

Puh, that was great! Awesome finds, thank you so much @JackNewman12 :clap::clap::clap: