Running the coordinating R/clustermq process on a different HPC node

mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH

https://mschubert.github.io/clustermq/

Apache License 2.0

146 stars 27 forks source link

Running the coordinating R/clustermq process on a different HPC node #216

Closed mattwarkentin closed 3 years ago

mattwarkentin commented 3 years ago

Hi @mschubert,

When using the ssh + slurm combination run jobs, this is my mental model of how things seem to work (going with the simple case of 1 worker process):

Process 1 - The calling R process which exists on my desktop (calls Q())
Process 2 - The calling process then spawns a coordinating R process on the HPC via SSH
Process 3 - The coordinating R process then submits the batch job, creating 1 persistent worker process

The worker communicates with the coordinating process, and the coordinating process communicates back with the spawning process. Is this a correct mental model?

If so, is there a way to get Process 2 to run on a node other than the login/head node?

mschubert commented 3 years ago

That's right, except that your Process 3 is one worker/process per array index.

The R process that runs in the head node should be very light in terms of CPU, but does hold your common_data in memory. There is currently no way to circumvent this.

I suppose the reason you're asking is because you're worried about memory usage on the head node?

My answer to this would be that common_data should be small when sent via SSH, and larger data sets should be accessed via network storage.

mattwarkentin commented 3 years ago

I suppose the reason you're asking is because you're worried about memory usage on the head node?

Actually, I just received a bit of a scathing email from my institutes sysadmin. They are rather vigilante at trying to stop users from executing any long-running processes on the head node, understandably so. In this unique case, I was doing some debugging yesterday and for various reasons my jobs were terminating in an unusual way that left many orphaned coordinating R/cmq processes on the head node that apparently were still running today.

Presumably if I provide a different address to options(clustermq.ssh.host = "..."), such as a dev/interactive node with slurm job submission permissions, this would circumvent the head node, right?

mschubert commented 3 years ago

Presumably if I provide a different address to options(clustermq.ssh.host = "..."), such as a dev/interactive node with slurm job submission permissions, this would circumvent the head node, right?

Yes, that should work - as long as you set it up in your .ssh/config (I assume you know how, otherwise I can type it out)

mattwarkentin commented 3 years ago

Okay, great. I will give it a try. Thanks.

mattwarkentin commented 3 years ago

While it's on my mind, do you think there is any value in adding an argument to Q()/Q_rows() that allows the user to directly pass the clustermq configuration options directly (e.g. clustermq.scheduler, etc.)? This would avoid the hidden argument issue. Right now Q() isn't self-contained, since its behaviour depends on externally defined global options. These options might live in the same script, a separate script, or a startup file like .Rprofile.

If it had an argument for opts/options/whatever, then you could pass these as a list. Global options set with options() could be used as a fallback. By default, the functions new options argument could look for global options and use qsys_default in their absence:

Q <- function(<current args>, options = getOption("clustermq.scheduler", qsys_default))

Thoughts?

mschubert commented 3 years ago

There is no hidden argument issue.

options only specify how Q is run, never what Q returns. Moving to a different setup with e.g. a different clustermq.scheduler does not rely on changing function arguments, and I'd argue that code should be portable between compute environments.

(Note that if you really want to you can already circumvent this by passing Q(..., workers=workers(qsys_id=...)).)