troglobit / finit

Fast init for Linux. Cookies included
https://troglobit.com/projects/finit/
MIT License
632 stars 63 forks source link

Bug when service is crashed and restarted initctl shows wrong pid #226

Closed hongkongkiwi closed 2 years ago

hongkongkiwi commented 2 years ago

I have the following service:

service [12345789] name:myapp :media pid:!myapp:media /usr/sbin/myapp-media -P /run/myapp:media.pid -- Media daemon

Using initctl status myapp:media gives the correct result.

However, if I kill the process using the pid provided above (I am simulating a crash), then the app is restarted by finit. However after the app is restarted, the pid in initctl status is not updated.

It's correct in the pid file because that's managed by my service, it just seems like it doesn't reread that (or update the internal db) when restarting the service.

This is a problem for me because I'm using my workaround command in #225 to send signals and I would prefer to have initctl tell me the correct pid than to read the pid file in my own script because then I have to have knowledge of what the pid file name is.

troglobit commented 2 years ago

I'm afraid this must be your kernel again. I just added a new test¹ for this particular case and I cannot reproduce the problem. Allow me to explain a little more about the monitoring in Finit; when a well-behaved service (A) starts up in the foreground, Finit knows it's PID, but to be able to safely start any depending services (B and C) it waits for the service (A) to create it's PID file. Finit reads all PID files created in /run on every inotify event from the kernel. If it finds the PID it waits for, in the expected PID file, the service's (A) pid condition is asserted.

Hence, if inotify is not working properly that mechanism is broken. There may be unexpected behavior/artifacts in internal structs when this occurs, e.g. wrong PID shown etc.

__
¹ the first run failed because I forgot to add the testcase to EXTRA_DIST. Here's a link to the second run: https://github.com/troglobit/finit/runs/5167392996?check_suite_focus=true#step:7:612

troglobit commented 2 years ago

So, I have to retract my previous statement ... I got a very similar report (privately) from a client. They had spotted a behavior just as you described, but with dnsmasq, when reconfiguring their system at runtime. Finit refused to restart the dying service, hanging on to its old PID.

I've been attempting to recreate this problem using the test case I mentioned previously; start-kill-service.sh. It's been really hard ... that is until I increased the number of laps I stick to the kill/restart sequence from 1000 to 100000! Turns out I get the same behavior from anywhere around lap 2000 to 74000, it's not been consistent at all.

I've had several theories over the last few weeks, but none have really panned out until this morning when I managed to enable logging in a reasonable way and found -- that Finit does indeed detect the PID crashing (so signals aren't lost), but it thinks the process is a forking service (sysv start script) and exits early waiting for the daemon/script to create its PID file ...

Tweaking the classification of what is a forking service seems to be the solution. I've now rerun the test (100000 laps) twice without a problem! So I'll be adding some more tests to also verify forking services with this tweak, but it's looking very promising.

Thanks for reporting this, and sorry for my being so dismissive earlier!

hongkongkiwi commented 2 years ago

No problem, thank you so much for further investigating this. I too thought it may just been some strange bug in earlier inotify implementations on my kernel.

I did think it was a bit strange is I have used another inotify daemon implementation on my system and with the tiny patch I mentioned in another thread, I'm able to have (I think) very reliable pid detection including delete, update, create etc so that's why I was a bit confused.

troglobit commented 2 years ago

This should now be fixed. Relevant major commits: ba77e4f, 20290a4, a39d958