Closed JackNewman12 closed 1 year ago
Hmm, I'll look into the start/stop/restarting issue. Haven't seen anything like that myself yet, but I'll get back to you.
Re: PID files. By default, Finit expects services to maintain their own PID files. It monitors /run for new files (with some heuristics applied) to sniff new PID files and match them to the PID it recently forked off. PID files are actually the primary synchronization/readiness mechanism, which you've probably already seen traces of.
E.g., if service B depends on A you can set a condition like this:
service <pid/A> B
Only trouble is that service A must reassert (touch) it's PID file after the following:
initctl edit A
initctl reload # Sends SIGHUP by default to A
Most daemons today don't rewrite/touch their PID file on SIGHUP, so PID condition A will then remain in "flux" and B will not be restarted properly. I don't know if this could be your problem, since you don't mention how your services are started by Finit, if you use any conditions for instance. But I thought it worth mentioning since I write so crappy docs.
Finit has support for creating PID files on its own, as well as figuring out stuff from forking daemons (like was the practise with SysV init scripts): pid:/path/to/pidfile
. The syntax is described further in the documentation.
Since most daemons don't behave properly a few other standards for "readiness notification" were invented. For example, in systemd the application is expected to write READY=1\n
on a socket to systemd after start and restart/reload. (For details, see issue #299 which was just added to Finit master for v4.4.)
Sorry, cannot reproduce. I'm running Finit master (on myLinux) with the following config for my service:
root@anarchy:~# initctl show mdnsd
# Launch mDNS-SD daemon
service [2345789] /usr/sbin/mdnsd -sn -- mDNS-SD daemon
When I stop it I get the following trace. Worth noting is the log message mdnsd( 0): stopping disabled/clean cond:on
, which tells me that Finit has received the message to stop the service. In your case above it still says 'enabled' which is quite puzzling:
root@anarchy:~# initctl stop mdnsd
finit[1]: api_cb():svc query: mdnsd
finit[1]: svc_parse_jobstr():Got str:'mdnsd'
finit[1]: svc_parse_jobstr():Got token:'mdnsd'
finit[1]: api_cb():stop mdnsd
finit[1]: svc_parse_jobstr():Got str:'mdnsd'
finit[1]: svc_parse_jobstr():Got token:'mdnsd'
finit[1]: service_step(): mdnsd(11528): running disabled/clean cond:on
finit[1]: service_stop():Sending SIGTERM to pid:11528 name:mdnsd
finit[1]: Stopping mdnsd[11528], sending SIGTERM ...
finit[1]: service_stop():kill(-11528, 15) => rc 0, errno 2
finit[1]: svc_set_state():mdnsd is stopping, wait 3 sec before sending SIGKILL ...
finit[1]: cond_clear():service/mdnsd/
finit[1]: cond_clear_noupdate():service/mdnsd/
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ <= 0
finit[1]: service_step(): mdnsd(11528): -> stopping
finit[1]: service_step(): mdnsd(11528): stopping disabled/clean cond:on
root@anarchy:~# finit[1]: generic_io_cb():Calling I/O pidfile from runloop...
finit[1]: pidfile_update_conds():path: /run/mdnsd.pid, mask: 00000200
finit[1]: pidfile_update_conds():Found svc mdnsd for /run/mdnsd.pid with pid 11528
finit[1]: cond_clear():pid/mdnsd
finit[1]: cond_clear_noupdate():pid/mdnsd
finit[1]: cond_set_path():/run/finit/cond/pid/mdnsd <= 0
finit[1]: cond_update():pid/mdnsd
finit[1]: cond_set():service/mdnsd/ready
finit[1]: cond_set_noupdate():service/mdnsd/ready
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ready <= 2
finit[1]: cond_update():service/mdnsd/ready
finit[1]: sigchld_cb():Collected child 11528, status: 0
finit[1]: service_monitor():collected mdnsd(11528), normal exit: 1, signaled: 0, exit code: 0
finit[1]: service_step(): mdnsd( 0): stopping disabled/clean cond:on
finit[1]: service_step():mdnsd: stopped, cleaning up timers and conditions ...
finit[1]: cond_clear():pid/mdnsd
finit[1]: cond_clear_noupdate():pid/mdnsd
finit[1]: cond_set_path():/run/finit/cond/pid/mdnsd <= 0
finit[1]: cond_clear():service/mdnsd/
finit[1]: cond_clear_noupdate():service/mdnsd/
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ <= 0
finit[1]: cond_set_oneshot():service/mdnsd/stopped
finit[1]: cond_set_oneshot_noupdate():service/mdnsd/stopped => /run/finit/cond/service/mdnsd/stopped
finit[1]: cond_update():service/mdnsd/stopped
finit[1]: service_step(): mdnsd( 0): -> stopped
finit[1]: service_step(): mdnsd( 0): stopped disabled/clean cond:on
finit[1]: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit[1]: cgroup_handle_event():event: '/sys/fs/cgroup/system/mdnsd/cgroup.events', mask: 00000002
When I start it up again I can see Finit registering my wish to enable it:
root@anarchy:~# initctl restart mdnsd
finit[1]: api_cb():svc query: mdnsd
finit[1]: svc_parse_jobstr():Got str:'mdnsd'
finit[1]: svc_parse_jobstr():Got token:'mdnsd'
finit[1]: api_cb():restart mdnsd
finit[1]: svc_parse_jobstr():Got str:'mdnsd'
finit[1]: svc_parse_jobstr():Got token:'mdnsd'
finit[1]: service_step(): mdnsd( 0): halted enabled/clean cond:on
finit[1]: cond_clear():service/mdnsd/
finit[1]: cond_clear_noupdate():service/mdnsd/
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ <= 0
finit[1]: service_step(): mdnsd( 0): -> waiting
finit[1]: service_step(): mdnsd( 0): waiting enabled/clean cond:on
finit[1]: cond_clear():service/mdnsd/
finit[1]: cond_clear_noupdate():service/mdnsd/
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ <= 0
finit[1]: service_step(): mdnsd( 0): -> UNKNOWN
finit[1]: service_step(): mdnsd( 0): UNKNOWN enabled/clean cond:on
finit[1]: cgroup_leaf_init():group system, name mdnsd, pid 11836, cfg
finit[1]: group_init():path /sys/fs/cgroup/system/mdnsd, leaf 1, cfg
finit[1]: iwatch_add():adding new watcher for path /sys/fs/cgroup/system/mdnsd/cgroup.events
finit[1]: Starting mdnsd[11836]
finit[1]: cond_clear():service/mdnsd/
finit[1]: cond_clear_noupdate():service/mdnsd/
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ <= 0
finit[1]: cond_set_oneshot():service/mdnsd/running
finit[1]: cond_set_oneshot_noupdate():service/mdnsd/running => /run/finit/cond/service/mdnsd/running
finit[1]: cond_update():service/mdnsd/running
finit[1]: service_step(): mdnsd(11836): -> running
finit[1]: service_step(): mdnsd(11836): running enabled/clean cond:on
root@anarchy:~# finit[1]: generic_io_cb():Calling I/O pidfile from runloop...
finit[1]: pidfile_update_conds():path: /run/mdnsd.pid, mask: 00000002
finit[1]: pidfile_update_conds():Found svc mdnsd for /run/mdnsd.pid with pid 11836
finit[1]: pidfile_update_conds():Setting mdnsd PID file to /run/mdnsd.pid
finit[1]: cond_set():pid/mdnsd
finit[1]: cond_set_noupdate():pid/mdnsd
finit[1]: cond_set_path():/run/finit/cond/pid/mdnsd <= 2
finit[1]: cond_update():pid/mdnsd
finit[1]: cond_set():service/mdnsd/ready
finit[1]: cond_set_noupdate():service/mdnsd/ready
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ready <= 2
finit[1]: cond_update():service/mdnsd/ready
finit[1]: pidfile_update_conds():path: /run/mdnsd.pid, mask: 00000008
finit[1]: pidfile_update_conds():Found svc mdnsd for /run/mdnsd.pid with pid 11836
finit[1]: pidfile_update_conds():Setting mdnsd PID file to /run/mdnsd.pid
finit[1]: cond_set():pid/mdnsd
finit[1]: cond_set_noupdate():pid/mdnsd
finit[1]: cond_set_path():/run/finit/cond/pid/mdnsd <= 2
finit[1]: cond_set():service/mdnsd/ready
finit[1]: cond_set_noupdate():service/mdnsd/ready
finit[1]: cond_set_path():/run/finit/cond/service/mdnsd/ready <= 2
So ... unfortunately I cannot reproduce your issue, or even speculate further what it could be. Sorry :-/
Ah, one thing that might be interesting for you, don't know if you've noticed it yourself, but in your restart case:
finit: service_monitor():collected myservice(4727), normal exit: 1, signaled: 0, exit code: 1
Your application exits with exit code 1, that's why it's constantly restarting.
Sorry I should have stated my service file. It was really simple so I just brushed over it
service name:myservice /sbin/myservice
I didn't like that I saw what looks like a "pid file found" message as part of the stopping so I deleted the pidfile.so plugin from the disk. Now starting/stopping all of my services is reliable.
I am not sure if removing this plugin will break my service<->service dependencies.
What is the difference between <service/myservice>
vs <pid/myservice>
? I guess the former doesn't wait for the myservice to be "ready" but rather just starts the child service as soon as myservice is enabled.
For reference here is a good vs bad daemon log. The service files are identical service /sbin/XXX
with different binaries.
To make the diff a little easier on the eyes I replaced the names and PIDs in the logs to match. (Does github support side-by-side diff markdown? It would make this so much easier to parse).
It seems the good service does a cond_set()
, but the misbehaving service does a cond_clear()
before the cond_set()
EDIT: Looking at the code I wonder if finit thinks these are forking services? Is that automatically detected? Either way none of these daemons are forking, just a simple mainloop with while(true).
Good:
init: api_cb():svc query: myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: api_cb():stop myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: service_step(): myservice(2300): running disabled/clean cond:on
finit: service_stop():Sending SIGTERM to pid:2300 name:myservice
finit: Stopping myservice[2300], sending SIGTERM ...
finit: service_stop():kill(-2300, 15) => rc 0, errno 2
finit: svc_set_state():myservice is stopping, wait 3 sec before sending SIGKILL ...
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step(): myservice(2300): -> stopping
finit: service_step(): myservice(2300): stopping disabled/clean cond:on
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000008
finit: pidfile_update_conds():Found svc myservice for /var/run/myservice.pid with pid 2300
finit: pidfile_update_conds():Setting myservice PID file to /var/run/myservice.pid
finit: cond_set():pid/myservice
finit: cond_set_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 2
finit: cond_set():service/myservice/ready
finit: cond_set_noupdate():service/myservice/ready
finit: cond_set_path():/var/run/finit/cond/service/myservice/ready <= 2
finit: cond_update():service/myservice/ready
finit: sigchld_cb():Collected child 2300, status: 0
finit: service_monitor():collected myservice(2300), normal exit: 1, signaled: 0, exit code: 0
finit: service_step(): myservice( 0): stopping disabled/clean cond:on
finit: service_step():myservice: stopped, cleaning up timers and conditions ...
finit: cond_clear():pid/myservice
finit: cond_clear_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 0
finit: cond_update():pid/myservice
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/stopped
finit: cond_set_oneshot_noupdate():service/myservice/stopped => /var/run/finit/cond/service/myservice/stopped
finit: cond_update():service/myservice/stopped
finit: service_step(): myservice( 0): -> stopped
finit: service_step(): myservice( 0): stopped disabled/clean cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/myservice.pid
Bad:
init: api_cb():svc query: myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: api_cb():stop myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: service_step(): myservice(2300): running disabled/clean cond:on
finit: service_stop():Sending SIGTERM to pid:2300 name:myservice
finit: Stopping myservice[2300], sending SIGTERM ...
finit: service_stop():kill(-2300, 15) => rc 0, errno 2
finit: svc_set_state():myservice is stopping, wait 3 sec before sending SIGKILL ...
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step(): myservice(2300): -> stopping
finit: service_step(): myservice(2300): stopping disabled/clean cond:on
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000008
finit: pidfile_update_conds():Found svc myservice for /var/run/myservice.pid with pid 2300
finit: pidfile_update_conds():Setting myservice PID file to /var/run/myservice.pid
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: cond_set():pid/myservice
finit: cond_set_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 2
finit: cond_set():service/myservice/ready
finit: cond_set_noupdate():service/myservice/ready
finit: cond_set_path():/var/run/finit/cond/service/myservice/ready <= 2
finit: cond_update():service/myservice/ready
finit: sigchld_cb():Collected child 2300, status: 0
finit: service_monitor():collected myservice(2300), normal exit: 1, signaled: 0, exit code: 0
finit: Stopped myservice[2300]
finit: service_step(): myservice( 0): running enabled/clean cond:on
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/restart
finit: cond_set_oneshot_noupdate():service/myservice/restart => /var/run/finit/cond/service/myservice/restart
finit: cond_update():service/myservice/restart
finit: service_step():delayed restart of myservice
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/myservice.pid
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: api_cb():get runlevel
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: api_cb():get runlevel
finit: service_retry():myservice crashed, trying to start it again, attempt 3
finit: Service myservice[2300] died, restarting in 2000 msec (3/10)
finit: service_step(): myservice( 0): halted enabled/clean cond:on
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step(): myservice( 0): -> waiting
finit: service_step(): myservice( 0): waiting enabled/clean cond:on
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step(): myservice( 0): -> UNKNOWN
finit: service_step(): myservice( 0): UNKNOWN enabled/clean cond:on
finit: Starting myservice[3689]
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: service_step(): myservice(3689): -> running
finit: service_step(): myservice(3689): running enabled/clean cond:on
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000002
finit: pidfile_update_conds():Found svc myservice for /var/run/myservice.pid with pid 3689
finit: pidfile_update_conds():Setting myservice PID file to /var/run/myservice.pid
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: cond_set():pid/myservice
finit: cond_set_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 2
finit: cond_set():service/myservice/ready
finit: cond_set_noupdate():service/myservice/ready
finit: cond_set_path():/var/run/finit/cond/service/myservice/ready <= 2
finit: cond_update():service/myservice/ready
As for the restarting bug, I think you are correct and I just missed the fact the daemon was crashing due to me stopping some other services. Whoops.
To add to the fire, I didn't mention yesterday that I also see services get stuck in a "running" state (both with and without the pidfile plugin). I did a bit more work and I believe I have found the reason. If myservice is crashing and I call initctl stop myservice
it gets into this incorrect state.
The service shows as "running" with a PID of 0. ps
shows the service is not actually running.
# initctl status myservice
Status : running
Identity : myservice
Description : myservice
Origin : /etc/init.d/myservice.conf
Command : /sbin/myservice
PID file : none
PID : 0
User : root
Group : root
Uptime : N/A
Restarts : 1 (0/10)
Runlevels : [--234-----]
Calling stop is basically a nop
and the service is still listed as "running".
# initctl stop myservice
finit: api_cb():svc query: myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: api_cb():stop myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: service_step(): myservice( 0): running disabled/clean cond:on
# initctl status myservice
Status : running
Identity : myservice
Description : myservice
Origin : /etc/init.d/myservice.conf
Command : /sbin/myservice
PID file : none
PID : 0
User : root
Group : root
Uptime : N/A
Restarts : 1 (0/10)
Runlevels : [--234-----]
How to recreate the state:
finit: service_monitor():collected myservice(4293), normal exit: 1, signaled: 0, exit code: 1
finit: service_step(): myservice( 0): running enabled/clean cond:on
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/restart
finit: cond_set_oneshot_noupdate():service/myservice/restart => /var/run/finit/cond/service/myservice/restart
finit: cond_update():service/myservice/restart
finit: service_step():delayed restart of myservice
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
init: api_cb():svc query: myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: api_cb():stop myservice
finit: svc_parse_jobstr():Got str:'myservice'
finit: svc_parse_jobstr():Got token:'myservice'
finit: service_step(): myservice( 0): stopped disabled/clean cond:on
# initctl status myservice
Status : stopped (code=exited, status=1/FAILURE)
Identity : myservice
Description : myservice
Origin : /etc/init.d/myservice.conf
Command : /sbin/myservice
PID file : none
PID : 0
User : root
Group : root
Uptime : N/A
Restarts : 7 (6/10)
Runlevels : [--234-----]
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: Successfully restarted crashing service myservice.
# initctl status myservice
Status : running
Identity : myservice
Description : myservice
Origin : /etc/init.d/myservice.conf
Command : /sbin/myservice
PID file : none
PID : 0
User : root
Group : root
Uptime : N/A
Restarts : 7 (6/10)
Runlevels : [--234-----]
So it looks like doing a service stop also needs cancel the retry timer.
Some more debugging later and sure enough in pidfile.c
in pidfile_update_conds()
the svc_is_forking()
check is true.
No idea how that happens. I'll keep searching
So these daemons take some time to boot as they rely on certain things. I guess this causes finit to detect they are forking from pidfile.c
doing a pid_file_set(svc, fn, 1);
I am happy with that, no harm done. Except the service_forked(svc);
causes the state machine to reset.
With my very limited knowledge, I think one of these fixes the issue: (probably missing a bunch of edge cases I don't know about yet)
void service_forked(svc_t *svc)
/* Stop forking_retry() timer */
service_timeout_cancel(svc);
- /* Not crashing, restore RUNNING state */
- svc_unblock(svc);
- svc_set_state(svc, SVC_RUNNING_STATE);
+ if (!svc_is_stopping(svc)) {
+ /* Not crashing, restore RUNNING state */
+ svc_unblock(svc);
+ svc_set_state(svc, SVC_RUNNING_STATE);
+ }
}
Although I guess the mask check earlier should be different. A closed pid file probably shouldn't trigger a svc_started()
, service_forked()
and service_ready()
at all.
@@ -81,29 +81,37 @@ static void pidfile_update_conds(char *dir, char *name, uint32_t mask)
dbg("Found svc %s for %s with pid %d", svc->name, fn, svc->pid);
mkcond(svc, cond, sizeof(cond));
- if (mask & (IN_CLOSE_WRITE | IN_ATTRIB | IN_MODIFY | IN_MOVED_TO)) {
+ if (mask & (IN_ATTRIB | IN_MODIFY | IN_MOVED_TO)) {
svc_started(svc);
if (!svc_has_pidfile(svc)) {
dbg("Setting %s PID file to %s", svc->name, fn);
pid_file_set(svc, fn, 1);
}
+ Add some sort of dedicated IN_CLOSE_WRITE check later? Perhaps its just part of the IN_DELETE check?
To fix the restarting/stop race condition I threw in this quick fix:
--- a/user/init/src/api.c
+++ b/user/init/src/api.c
@@ -63,6 +63,7 @@ static int stop(svc_t *svc, void *user_data)
if (!svc)
return 1;
+ service_timeout_cancel(svc);
svc_stop(svc);
service_step(svc);
Although I wouldn't be surprised if you prefer this cancel logic to be inside the service_step()
state machine. I just didn't want to try and wrap my head around it 😂
Let us know if I can help further and if you want me to make a pull request, or if you are happy to smash out some (probably much better) fixes.
This is very interesting, I guess the results above are with latest Finit master? Anyway, I'll be going over this all today (CET) to see if I can reproduce, thank you! If you have any other input on smallest possible setup to reproduce, I'd be most grateful. I try to encode problems like these in test cases, many of them use the stupid test/common/serv.c daemon, which I'll use two of with different name:..
to reproduce.
Some quick comments/replies to your questions above:
service type:forking ...
serivce pid:!/wait/for/this/pid.file ...
pidfile_update_conds()
is admittedly weird, but iirc it comes from how Quagga/FRR create/update PID files -- I'll include this too in the testing/reproductionI guess the results above are with latest Finit master?
Yes. f6b5b02
There are two ways for Finit to detect if a service is forking, both are declarative, none of which you use so there's definitely something fishy going on:
From my debug prints, it seems like:
service_register()
gets called. svc->forking == 0
and pidfile == NULL
pid_file_set(_,_,1)
gets called from pidfile_update_conds()
which sets the pid path to !/run/myservice.pid
(TBD if /run/myservice.pid exists by now or not)service_register()
gets called (again?) this time svc->forking == 0
but since the pidfile has been set to !/run/myservice.pid
forking gets overridden to true.The mask in pidfile_update_conds() is admittedly weird
Is this an issue? All of our daemons just close the pid file and leave it alone. I think init should clean these up IIRC.
Re, pull-request
Cheers. You can just credit me directly. I should probably update my github profile since I don't work at Cohda Wireless anymore 😅
I feel like a finit.debug
would help, but I don't seem to get any extra prints. I know the rescue
cmdline works so either finit is ignoring me or I need to add some extra verbosity flags :/
I guess the results above are with latest Finit master? Yes. f6b5b02
Great, thanks!
There are two ways for Finit to detect if a service is forking, both are declarative, none of which you use so there's definitely something fishy going on: From my debug prints, it seems like:
service_register()
gets called.svc->forking == 0
andpidfile == NULL
- The service gets started
pid_file_set(_,_,1)
gets called frompidfile_update_conds()
which sets the pid path to!/run/myservice.pid
(TBD if /run/myservice.pid exists by now or not)service_register()
gets called (again?) this timesvc->forking == 0
but since the pidfile has been set to!/run/myservice.pid
forking gets overridden to true.
Ah right, PID file detection is overloaded with the legacy forking option ... ouch. OK, I'll have a look at this, gives me a better idea also what to focus on when reproducing.
The service_register()
function is called only at boot, runlevel change, and initctl reload
.
The mask in pidfile_update_conds() is admittedly weird Is this an issue? All of our daemons just close the pid file and leave it alone. I think init should clean these up IIRC.
What do you mean, clean up PID files? If an application creates them it's their responsibility, or their start script (SysV init).
Re, pull-request Cheers. You can just credit me directly. I should probably update my github profile since I don't work at Cohda Wireless anymore sweat_smile
OK :smiley:
I feel like a
finit.debug
would help, but I don't seem to get any extra prints. I know therescue
cmdline works so either finit is ignoring me or I need to add some extra verbosity flags :/
Not following here. But just fyi, there are no more verbosity flags. -- finit.debug
on kernel command line enables debugging at boot time and initctl debug
toggles it at runtime.
Status update: have reproduced both the running
+pid:0 issue and the weird change from non-forking type to forking (pretty sure service_register()
is the culprit for the latter).
Just to clarify, that pidfile.so sets the PID file is just the pidfile auto-discovery mechanism. So that we want to keep, since it's used by the process synchronization with <pid/foo>
conditions.
I'll start working on a regression test now to trigger these bugs and then commit fixes to them. Your restart callback cancellation being one of them.
There, regression test and fixes in place. Hope I nailed all your various issues down ... leaving this issue open until you've had the time to verify.
(Don't mind the failed CI run for :x: 3b601d0, just me having messed up the chroot for tests in Azure. Works now as intended.)
Perfect! Looks like this issue is solved. Manually starting/stopping/crashing is all good and I can't fault it. My handful of daemons that got the forking status are no longer getting that flag and also seem to be behaving.
🎉🎉 Thank you
Not sure if I should make a new ticket, but I found the same race condition with crashing and conditions.
service <service/parent/running> /bin/myservice
## or
service <pid/parent> /bin/myservice
parent
servicemyservice
crashes due to parent
disappearing (I guess before finit has had a chance to stop it? TBD I think finit should probably kill "dependent" services before stopping the user requested one.)myservice
retry timer starts initctl cond
shows <service/parent/running>
as offmyservice
timer gets hit and the service becomes running
with pid of 0
again.I can't seem to get this same race condition using the /usr
feature. I guess the SIGHUP must cover it up.
If I do a initctl stop myservice; initctl start myservice
then I correctly get the waiting
status on myservice
The log:
finit: api_cb():svc query: parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: api_cb():stop parent
finit: svc_parse_jobstr():Got str:'parent'
finit: svc_parse_jobstr():Got token:'parent'
finit: service_step(): parent(7633): running disabled/clean cond:on
finit: service_stop():Sending SIGTERM to pid:7633 name:parent
finit: Stopping parent[7633], sending SIGTERM ...
finit: service_stop():kill(-7633, 15) => rc 0, errno 2
finit: svc_set_state():parent is stopping, wait 3 sec before sending SIGKILL ...
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: service_step(): parent(7633): -> stopping
finit: service_step(): parent(7633): stopping disabled/clean cond:on
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000008
finit: pidfile_update_conds():Found svc parent for /var/run/parent.pid with pid 7633
finit: pidfile_update_conds():Setting parent PID file to /var/run/parent.pid
finit: cond_set():pid/parent
finit: cond_set_noupdate():pid/parent
finit: cond_set_path():/var/run/finit/cond/pid/parent <= 2
finit: cond_set():service/parent/ready
finit: cond_set_noupdate():service/parent/ready
finit: cond_set_path():/var/run/finit/cond/service/parent/ready <= 2
finit: cond_update():service/parent/ready
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000008
finit: pidfile_update_conds():Found svc myservice for /var/run/myservice.pid with pid 7634
finit: pidfile_update_conds():Setting myservice PID file to /var/run/myservice.pid
finit: cond_set():pid/myservice
finit: cond_set_noupdate():pid/myservice
finit: cond_set_path():/var/run/finit/cond/pid/myservice <= 2
finit: cond_set():service/myservice/ready
finit: cond_set_noupdate():service/myservice/ready
finit: cond_set_path():/var/run/finit/cond/service/myservice/ready <= 2
finit: cond_update():service/myservice/ready
finit: sigchld_cb():Collected child 7633, status: 256
finit: service_monitor():collected parent(7633), normal exit: 1, signaled: 0, exit code: 1
finit: service_step(): parent( 0): stopping disabled/clean cond:on
finit: service_step():parent: stopped, cleaning up timers and conditions ...
finit: cond_clear():pid/parent
finit: cond_clear_noupdate():pid/parent
finit: cond_set_path():/var/run/finit/cond/pid/parent <= 0
finit: cond_update():pid/parent
finit: cond_clear():service/parent/
finit: cond_clear_noupdate():service/parent/
finit: cond_set_path():/var/run/finit/cond/service/parent/ <= 0
finit: cond_set_oneshot():service/parent/stopped
finit: cond_set_oneshot_noupdate():service/parent/stopped => /var/run/finit/cond/service/parent/stopped
finit: cond_update():service/parent/stopped
finit: service_step(): parent( 0): -> stopped
finit: service_step(): parent( 0): stopped disabled/clean cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: sigchld_cb():Collected child 7634, status: 0
finit: service_monitor():collected myservice(7634), normal exit: 1, signaled: 0, exit code: 0
finit: service_step(): myservice( 0): running enabled/clean cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/restart
finit: cond_set_oneshot_noupdate():service/myservice/restart => /var/run/finit/cond/service/myservice/restart
finit: cond_update():service/myservice/restart
finit: service_step():delayed restart of myservice
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: pidfile_update_conds():path: /var/run/parent.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/parent.pid
finit: pidfile_update_conds():path: /var/run/myservice.pid, mask: 00000200
finit: pidfile_update_conds():No matching svc for /var/run/myservice.pid
finit: service_retry():myservice crashed, trying to start it again, attempt 5
finit: Service myservice[7634] died, restarting in 2000 msec (5/10)
finit: service_step(): myservice( 0): halted enabled/clean cond:off
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: service_step(): myservice( 0): -> waiting
finit: service_step(): myservice( 0): waiting enabled/clean cond:off
finit: generic_io_cb():Calling I/O pidfile from runloop...
finit: cond_clear():service/myservice/
finit: cond_clear_noupdate():service/myservice/
finit: cond_set_path():/var/run/finit/cond/service/myservice/ <= 0
finit: cond_set_oneshot():service/myservice/running
finit: cond_set_oneshot_noupdate():service/myservice/running => /var/run/finit/cond/service/myservice/running
finit: cond_update():service/myservice/running
finit: Successfully restarted crashing service myservice.
Great to hear! :-)
Yes, please create a new issue with this issue. It's super hard to track different problems in the same issue. I agree with your suggestion in 2. above, and it'll likely solve the problem.
Awesome, thank you!
When attempting to stop or restart a service they can get into wonky states where finit doesn't seem to correctly stop or start them again.
Some side information:
All of our daemons run in the foreground
Our daemons create their own PID file as they start. The documentation seems to indicate finit should create/destroy PID files for the services automatically but if I make the daemon write to a different PID file finit seems to automatically detect this as displayed by the pid path in
initctl status myservice
. I attempted to use the forking mode and manually selecting the PID file but it does not seem to fix the issue below.Most of the services that seem to have issues are using wrapper scripts:
Sometimes certain services seem to have trouble starting and stopping.
The alternative is that after an application has crashed, or I have changed runlevels (which has the same not-stopping/starting bug), it seems like I can't get certain services to start again:
Running the application manually works perfectly. Watching ps while finit attempts to restart shows that nothing ever gets started.
This happens on both v4.3 with the restart fix patch and master. I can't remember if I saw this on the pure v4.3 branch