mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Optional heartbeating for the master process #131

Closed wlandau closed 5 years ago

wlandau commented 5 years ago

on.exit(qsys$finalize()) has no chance to run if the master is killed with SIGINT. This comes up in https://github.com/ropensci/drake/issues/772 (related: https://github.com/mschubert/clustermq/issues/130 and https://github.com/r-lib/callr/issues/99). Just as heartbeating for workers is proposed for #33, could we talk about optional heartbeating for the master process?

mschubert commented 5 years ago

Looking at Gabor's comments in https://github.com/r-lib/callr/issues/99, shouldn't this rather be solved by sending an interrupt to the master process?

This would be equivalent to hitting Ctrl+C on the master, which should clean up the jobs.

I'm afraid "heartbeating" between processes on the same system doesn't really make sense, because they do not communicate via a socket connection that may be dropped. Monitoring processes is a different issue from monitoring these socket connections.

wlandau commented 5 years ago

Looking at Gabor's comments in r-lib/callr#99, shouldn't this rather be solved by sending an interrupt to the master process?

Yes, but because callr uses SIGKILL rather than SIGINT, this is difficult to do this in a way that generalizes to all callr::r*() functions for drake. If the clustermq master process were to clean up the workers upon receiving SIGKILL, then that would solve my problem, but I suppose a custom SIGKILL handler may not be safe.

I'm afraid "heartbeating" between processes on the same system doesn't really make sense, because they do not communicate via a socket connection that may be dropped. Monitoring processes is a different issue from monitoring these socket connections.

I was thinking the workers could monitor the master process using the existing ZeroMQ sockets. That way, if the master died from a non-SIGINT signal, the workers could still clean themselves up.

mschubert commented 5 years ago

Yes, but because callr uses SIGKILL rather than SIGINT

You can still send SIGINT manually upon drake interrupt. Add a grace period, then shut down the callr process.

I would argue that drake should give its R child processes the chance to clean up.

If the clustermq master process were to clean up the workers upon receiving SIGKILL

As you rightly note, this is not how SIGKILL should be handled.

I was thinking the workers could monitor the master process using the existing ZeroMQ sockets

In a way, this already happens using timeouts: If a worker does not get new work after 600 seconds (default), it will shut itself down.

This timeout could be shortened as part of #33, I will make a note there.

wlandau commented 5 years ago

You can still send SIGINT manually upon drake interrupt. Add a grace period, then shut down the callr process.

Gabor explained to me how to do this with r_bg(). How would you suggest I do this with the blocking callr::r*() functions?

In a way, this already happens using timeouts: If a worker does not get new work after 600 seconds (default), it will shut itself down.

Is it possible to set the timeout in the template file or template argument? @pat-s, I think this could be a nice workaround for https://github.com/ropensci/drake/issues/772. drake uses clustermq::workers(), which does not have a timeout argument.

mschubert commented 5 years ago

How would you suggest I do this with the blocking callr::r*() functions?

Can you point me to where you use them in drake? I don't fully understand why you need this for clustermq.

Is it possible to set the timeout in the template file or template argument?

Yes, just pass the argument to the clustermq:::worker call in your template. If it is a fill-able field, you can set it like any other template value.

wlandau commented 5 years ago

Can you point me to where you use them in drake? I don't fully understand why you need this for clustermq.

I use the callr functions for r_make() and friends, explained here. The implementation lives at https://github.com/ropensci/drake/blob/master/R/api-callr.R.

r_make() calls make() in a clean reproducible session. By default, the session is launched with callr::r(), but the user can supply a custom function using the r_fn argument, e.g. r_make(r_fn = callr::r_bg).

Yes, just pass the argument to the clustermq:::worker call in your template. If it is a fill-able field, you can set it like any other template value.

Awesome! For drake, that would be make(template = list(timeout = 60)). Can I assume the template file does not need a special infuser placeholder in this case?

mschubert commented 5 years ago

The template line should look like:

R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout={{ worker_timeout }})'

if you want to be able to configure it via the template argument or

R --no-save --no-restore -e 'clustermq:::worker("{{ master }}", timeout=600)'

if not.

Note that workers might reach 60 second timeout if you send large objects and have many workers, e.g. if 100 workers start up simultaneously and sending the common data takes 0.6 seconds. This will not be that uncommon.

mschubert commented 5 years ago

@wlandau I understand the callr::r() issue now.

I don't see how to handle this cleanly, however. We handle SIGINT (which callr doesn't use), and there can/should not be a handler for SIGKILL. I haven't see a way to handle SIGTERM from within R (and am not sure if callr uses SIGTERM or SIGKILL).

Only way I see is to have callr send SIGINT before SIGTERM/SIGKILL.

mschubert commented 5 years ago

Closing this, all we can handle is covered by #33.