Closed wlandau closed 5 years ago
Looking at Gabor's comments in https://github.com/r-lib/callr/issues/99, shouldn't this rather be solved by sending an interrupt to the master process?
This would be equivalent to hitting Ctrl+C on the master, which should clean up the jobs.
I'm afraid "heartbeating" between processes on the same system doesn't really make sense, because they do not communicate via a socket connection that may be dropped. Monitoring processes is a different issue from monitoring these socket connections.
Looking at Gabor's comments in r-lib/callr#99, shouldn't this rather be solved by sending an interrupt to the master process?
Yes, but because callr
uses SIGKILL
rather than SIGINT
, this is difficult to do this in a way that generalizes to all callr::r*()
functions for drake
. If the clustermq
master process were to clean up the workers upon receiving SIGKILL
, then that would solve my problem, but I suppose a custom SIGKILL
handler may not be safe.
I'm afraid "heartbeating" between processes on the same system doesn't really make sense, because they do not communicate via a socket connection that may be dropped. Monitoring processes is a different issue from monitoring these socket connections.
I was thinking the workers could monitor the master process using the existing ZeroMQ sockets. That way, if the master died from a non-SIGINT
signal, the workers could still clean themselves up.
Yes, but because
callr
usesSIGKILL
rather thanSIGINT
You can still send SIGINT
manually upon drake interrupt. Add a grace period, then shut down the callr
process.
I would argue that drake
should give its R child processes the chance to clean up.
If the clustermq master process were to clean up the workers upon receiving
SIGKILL
As you rightly note, this is not how SIGKILL
should be handled.
I was thinking the workers could monitor the master process using the existing ZeroMQ sockets
In a way, this already happens using timeouts: If a worker does not get new work after 600 seconds (default), it will shut itself down.
This timeout could be shortened as part of #33, I will make a note there.
You can still send SIGINT manually upon drake interrupt. Add a grace period, then shut down the callr process.
Gabor explained to me how to do this with r_bg()
. How would you suggest I do this with the blocking callr::r*()
functions?
In a way, this already happens using timeouts: If a worker does not get new work after 600 seconds (default), it will shut itself down.
Is it possible to set the timeout in the template file or template
argument? @pat-s, I think this could be a nice workaround for https://github.com/ropensci/drake/issues/772. drake
uses clustermq::workers()
, which does not have a timeout
argument.
How would you suggest I do this with the blocking
callr::r*()
functions?
Can you point me to where you use them in drake
? I don't fully understand why you need this for clustermq
.
Is it possible to set the timeout in the template file or template argument?
Yes, just pass the argument to the clustermq:::worker
call in your template. If it is a fill-able field, you can set it like any other template value.
Can you point me to where you use them in drake? I don't fully understand why you need this for clustermq.
I use the callr
functions for r_make()
and friends, explained here. The implementation lives at https://github.com/ropensci/drake/blob/master/R/api-callr.R.
r_make()
calls make()
in a clean reproducible session. By default, the session is launched with callr::r()
, but the user can supply a custom function using the r_fn
argument, e.g. r_make(r_fn = callr::r_bg)
.
Yes, just pass the argument to the clustermq:::worker call in your template. If it is a fill-able field, you can set it like any other template value.
Awesome! For drake
, that would be make(template = list(timeout = 60))
. Can I assume the template file does not need a special infuser
placeholder in this case?
The template line should look like:
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout={{ worker_timeout }})'
if you want to be able to configure it via the template
argument or
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout=600)'
if not.
Note that workers might reach 60 second timeout if you send large objects and have many workers, e.g. if 100 workers start up simultaneously and sending the common data takes 0.6 seconds. This will not be that uncommon.
@wlandau I understand the callr::r()
issue now.
I don't see how to handle this cleanly, however. We handle SIGINT
(which callr
doesn't use), and there can/should not be a handler for SIGKILL
. I haven't see a way to handle SIGTERM
from within R (and am not sure if callr
uses SIGTERM
or SIGKILL
).
Only way I see is to have callr
send SIGINT
before SIGTERM
/SIGKILL
.
Closing this, all we can handle is covered by #33.
on.exit(qsys$finalize())
has no chance to run if the master is killed with SIGINT. This comes up in https://github.com/ropensci/drake/issues/772 (related: https://github.com/mschubert/clustermq/issues/130 and https://github.com/r-lib/callr/issues/99). Just as heartbeating for workers is proposed for #33, could we talk about optional heartbeating for the master process?