troglobit / finit

Fast init for Linux. Cookies included
https://troglobit.com/projects/finit/
MIT License
622 stars 61 forks source link

Service restart delay not working #310

Closed JackNewman12 closed 1 year ago

JackNewman12 commented 1 year ago

When a service is crashing I find that finit seems to restart the app instantly. This means all 10 retries can end up being used in a few milliseconds and ends up disabling the service.

Interestingly I can still see the event timer is running as X seconds later finit will print the "Successfully restarted" line.

The conf file is very simple:

service /sbin/MYSERVICE

Sorry I didn't have timestamps on the log, hopefully my comments are clear enough. Tomorrow I should get a chance to recapture this with timestamps.

Thanks!

finit: sigchld_cb():Collected child 1225
finit: service_monitor():collected MYSERVICE(1225), normal exit: 0, signaled: 1, exit code: 0
finit: service_step():           MYSERVICE(   0):  running  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/restart
finit: cond_set_oneshot_noupdate():service/MYSERVICE/restart => /var/run/finit/cond/service/MYSERVICE/restart
finit: cond_update():service/MYSERVICE/restart
finit: service_step():delayed restart of MYSERVICE
finit: service_step():           MYSERVICE(   0): ->  restart
finit: service_step():           MYSERVICE(   0):  restart disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: service_retry():MYSERVICE crashed, trying to start it again, attempt 5
finit: Service MYSERVICE[1225] died, restarting (5/10)
finit: service_step():           MYSERVICE(   0):   halted  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: service_step():           MYSERVICE(   0): ->    ready
finit: service_step():           MYSERVICE(   0):    ready  enabled/clean   cond:on

...app ends up restarting instantly ...

finit: service_start():Starting /sbin/MYSERVICE
finit: Starting MYSERVICE[1628]
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: service_step():           MYSERVICE(1628): ->  running
finit: service_step():           MYSERVICE(1628):  running  enabled/clean   cond:on

... Real delay of about 2 seconds ...

finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: Successfully restarted crashing service MYSERVICE.

... waiting some time and manually killing app again ...

finit: sigchld_cb():Collected child 1628
finit: service_monitor():collected MYSERVICE(1628), normal exit: 0, signaled: 1, exit code: 0
finit: service_step():           MYSERVICE(   0):  running  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/restart
finit: cond_set_oneshot_noupdate():service/MYSERVICE/restart => /var/run/finit/cond/service/MYSERVICE/restart
finit: cond_update():service/MYSERVICE/restart
finit: service_step():delayed restart of MYSERVICE
finit: service_step():           MYSERVICE(   0): ->  restart
finit: service_step():           MYSERVICE(   0):  restart disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: service_retry():MYSERVICE crashed, trying to start it again, attempt 6
finit: Service MYSERVICE[1628] died, restarting (6/10)
finit: service_step():           MYSERVICE(   0):   halted  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: service_step():           MYSERVICE(   0): ->    ready
finit: service_step():           MYSERVICE(   0):    ready  enabled/clean   cond:on

... app restarts instantly ...

finit: service_start():Starting /sbin/MYSERVICE
finit: Starting MYSERVICE[1881]
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: service_step():           MYSERVICE(1881): ->  running
finit: service_step():           MYSERVICE(1881):  running  enabled/clean   cond:on

... manually killing the app again in less than 5 seconds ...

finit: sigchld_cb():Collected child 1881
finit: service_monitor():collected MYSERVICE(1881), normal exit: 0, signaled: 1, exit code: 0
finit: service_step():           MYSERVICE(   0):  running  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/restart
finit: cond_set_oneshot_noupdate():service/MYSERVICE/restart => /var/run/finit/cond/service/MYSERVICE/restart
finit: cond_update():service/MYSERVICE/restart
finit: service_step():delayed restart of MYSERVICE
finit: service_step():           MYSERVICE(   0): ->  restart
finit: service_step():           MYSERVICE(   0):  restart disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: service_retry():MYSERVICE crashed, trying to start it again, attempt 7
finit: Service MYSERVICE[1881] died, restarting (7/10)
finit: service_step():           MYSERVICE(   0):   halted  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: service_step():           MYSERVICE(   0): ->    ready
finit: service_step():           MYSERVICE(   0):    ready  enabled/clean   cond:on
finit: service_start():Starting /sbin/MYSERVICE

... app restarts instantly ...

finit: Starting MYSERVICE[1889]
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: service_step():           MYSERVICE(1889): ->  running
finit: service_step():           MYSERVICE(1889):  running  enabled/clean   cond:on

... Real delay of about 5 seconds ...

finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: Successfully restarted crashing service MYSERVICE.
troglobit commented 1 year ago

Interesting! I've been troubleshooting a very similar issue.

What version of Finit are you using, or are you on the bleeding edge as myself?

JackNewman12 commented 1 year ago

The latest release v4.3.

I must note that I had to create the /var/run/finit/cond/usr folder myself to get usr conditions working. I wonder if something similar is happening with service folder. I'll have a peek today.

Any thoughts as to where I should be looking or extra debug to enable?

troglobit commented 1 year ago

OK. There are changes related to this coming up in v4.4 that I hope to release soon.

The usr conditions folder is created by the initctl tool, those conditions aren't really supposed to be managed by anything else.

Before we rush ahead with going into the code. Maybe you can tell me a little bit more about MYSERVICE and what you expect Finit to do, and also what are you doing from the outside? Is it a forking daemon, is it expected to crash (I see the signaled: 1 flag going up), or are you in some of the cases above killing the daemon to test it or Finit's restart mechanism?

The particular code in question starts here: https://github.com/troglobit/finit/blob/f483a8aeaa406ad8980ebe27ef46cb5135e89cb3/src/service.c#L1970-L1981

JackNewman12 commented 1 year ago

The usr conditions folder is created by the initctl tool, those conditions aren't really supposed to be managed by anything else.

Yes, although I would expect it to create that folder when required instead of failing. Adding the -create flag to the tool does not fix this. Either way adding a mkdir to a startup script is an easy fix.

Maybe you can tell me a little bit more about MYSERVICE

This bug actually applies to all of our daemons. I've been attempting to switch our (very old) init system over the finit to give us some more powerful features. In this case you can imagine a simple daemon that just sits there and does some Tx/Rx data from Ethernet/Serial/Etc. No forking. It was only when I noticed an application was crashing (i.e. the network was down and we didn't handle it) all 10 retries would happen instantly.

and what you expect Finit to do.

For 99% of cases if the daemon is not running, restart it (with a small delay). Nothing too fancy going on, although with finit I expect we will add some extra conditions like network up/down, service to service dependency, etc.

is it expected to crash (I see the signaled: 1 flag going up), or are you in some of the cases above killing the daemon to test it or Finit's restart mechanism?

In the example above I was just killing the daemon manually using kill MYSERVICE to test Finit's restart mechanism.

I didn't have much time today but it seems to do what I would expect. A service_timeout(service_retry) gets set for X seconds in future but the bottom of the service_step() loop causes it to keep jumping into the next state running-> restart -> halted -> ready. I would have expected once this callback gets set that it returns early so that the callback ends up doing the work. https://github.com/troglobit/finit/blob/f483a8aeaa406ad8980ebe27ef46cb5135e89cb3/src/service.c#L2063-L2067

troglobit commented 1 year ago

Thanks for the quick follow-up, was not expecting that! Very much appreciate some background and where-you're-at status :smiley:

Be aware that Finit use a lot of fairly modern kernel features sometimes people run into issues because of having an older kernel or missing some kernel config options.

OK, I'll look into the usr condition thing as a separate issue if I can reproduce it.

The initial delay, like you saw, very short. But then it should work as you say ... the service_retry() callback should handle the delay. Unless my mind's slipping and I've forgotten something obvious. I'll have a look at it during the day to see what I can find out.

JackNewman12 commented 1 year ago

Be aware that Finit use a lot of fairly modern kernel features sometimes people run into issues because of having an older kernel or missing some kernel config options.

We are using 5.4 IIRC. Is there a list of these kernel features I need to enable? We run a pretty minimal build. I don't see any errors from finit at boot, except missing support for cgroupv2.

Thanks for the support!

troglobit commented 1 year ago

That remains to be documented unfortunately, but all the new eventfd/signalfd/etc. and DEVTMPFS ... the best I can do meanwhile is to give you this: https://github.com/troglobit/myLinux/blob/main/board/amd64/linux_defconfig

Finit can take care of bootstrapping a pretty bare rootfs using, e.g., the bootmisc.so plugin (many plugins started out as optional but are now mandatory, this is one such). There are some more pointers in myLinux if you're curious.

and https://github.com/troglobit/myLinux/blob/main/board/common/busybox_defconfig, if you're on an embedded system with BusyBox. Notice BusyBox must not be built with CONFIG_MODPROBE_SMALL.

troglobit commented 1 year ago

I believe I have reproduced this issue now. There seems to be at least two cooperating bugs at play here. I'll try to get to the bottom of this over the weekend.

troglobit commented 1 year ago

There, finally found the root cause(s) of this! Thank you for taking the time to report it.

JackNewman12 commented 1 year ago

Thanks for fixing the restart issue! I managed to test it today and it is working.

If I kill an application it will restart instantly, however if I kill it again within a few seconds it will get the expected delay. Waiting for 5 seconds and killing the app will have it restart instantly again.

troglobit commented 1 year ago

Thank you for taking the time to test again! :)

Yeah, this is by design. We don't want to penalize a single crash, which could be attributed to startup issues. The intention is to increment the delay if the service crashes continuously -- so we track the number of crashes (per some period of time).