mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

SGE template: defaulting to user R path? #277

Closed nick-youngblut closed 2 years ago

nick-youngblut commented 2 years ago

The SGE template:

#$ -N {{ job_name }}               # job name
#$ -q default                      # submit to queue named "default"
#$ -j y                            # combine stdout/error in one file
#$ -o {{ log_file | /dev/null }}   # output file
#$ -cwd                            # use pwd as work dir
#$ -V                              # use environment variable
#$ -t 1-{{ n_jobs }}               # submit jobs as array
#$ -pe {{ cores | 1 }}             # number of cores to use per job

ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

seems to be defaulting to the R that is in the user's path, but the Rstudio R version might not be this R version, which causes conflicts. How can one swap between R versions in Rstudio and still have clustermq working?

mschubert commented 2 years ago

Thanks for bringing this up, it's an interesting question. Here's how I'm currently thinking about it:

As you mention, this can lead to issues when a different installation of R e.g. when you don't have a package installed. However, we are also leaving the head node and can not be entirely sure that the same path will be available. And we do not know at the point where we need to decide (the job submission script).

So, we have the options of:

If you work on HPC, you will usually set up your R path in the environment anyway. I haven't thought about the case when RStudio lets you specify the R executable differently, but at this point I will likely not change the default behavior.

In your case, you can change the last line in your template to:

CMQ_AUTH={{ auth }} /path/you/want/to/use/R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
nick-youngblut commented 2 years ago

For HPC schedulers, we use R from the environment instead of relying on the same path

At least for us, the R installs are located on a file system that is not available from the HPC, so we could use the same R install as used within Rstudio Server Pro. The only way I see that working is if we create R installs in custom locations that are available from the HPC.

Using an alternative R install from that of the Rstudio env is what is currently causing problems for users in my research group. When clustermq executes a HPC job with R installed via conda (instead of the R in Rstudio), renv tries to bootstrap a private R library, which fails. Below is our full clustermq template, which includes conda env activation.

I've posted this issue on behalf of members in my research group. I generally use Jupyter + Conda, so I just use the same conda env for the clustermq HPC jobs as for my Jupyter session. It would be quite helpful if Rstudio had a similar setup -- maybe with the Job Launcher. However, not all HPCs are supported by the Job Launcher (eg., our HPC).

#!/bin/bash
#$ -N {{ job_name }}                    # job name
#$ -pe parallel {{ cores | 1 }}         # job threads
#$ -l h_rt={{ job_time | 00:59:00 }}    # job time
#$ -l h_vmem={{ job_mem | 7G }}         # job memory
#$ -t 1-{{ n_jobs }}                    # submit jobs as array
#$ -j y                                 # combine stdout/error in one file
#$ -o {{ log_file | /dev/null }}        # output log file
#$ -cwd                                 # use pwd as work dir
#$ -V                                   # use environment variable

. ~/.bashrc
conda activate {{ conda | clustermq }}

export OMP_NUM_THREADS={{ omp.threads | 1 }}
export OPENBLAS_NUM_THREADS={{ blas.threads | 1 }}
export MKL_NUM_THREADS={{ mkl.threads | 1 }}

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
mschubert commented 2 years ago

At least for us, the R installs are located on a file system that is not available from the HPC, so we could use the same R install as used within Rstudio Server Pro.

I'm not sure I understand this sentence. Are you saying your RStudio Server Pro install is only available on the head node, and can not be accessed from HPC workers? Then by definition we can not use the same install for clustermq workers (which I thought is what you wanted based on your original query).

renv tries to bootstrap a private R library, which fails

I don't see why renv would bootstrap at all? (I'm also not that familiar with renv - but this sounds like it could be fixed on that level, or am I wrong?)

It would be quite helpful if Rstudio had a similar setup -- maybe with the Job Launcher.

I'm not sure what the Job Launcher is here?

nick-youngblut commented 2 years ago

Are you saying your RStudio Server Pro install is only available on the head node, and can not be accessed from HPC workers?

Essentially, yes. That's the situation.

I don't see why renv would bootstrap at all? (I'm also not that familiar with renv - but this sounds like it could be fixed on that level, or am I wrong?)

We haven't found a solution yet, but it seems to be triggered when the R version in the conda env differs from the version in the Rstudio session.

I'm not sure what the Job Launcher is here?

The docs: https://docs.rstudio.com/job-launcher/latest/ Maybe just a feature for the pro version of RStudio Server/Workbench? Regardless, it's useless to anyone with an SGE HPC.

mschubert commented 2 years ago

We haven't found a solution yet, but it seems to be triggered when the R version in the conda env differs from the version in the Rstudio session.

I don't see why this would happen. clustermq simply starts up an external R session (as specified in the template), and then establishes a connection via the network. It never instructs the workers to install any sort of package.

As long as your external R session starts up as expected (which you can test by manually starting it up), this should work. I unfortunately do not have enough information to debug this, as this is your specific environment (and I've never heard of such an issue before).

nick-youngblut commented 2 years ago

I'm trying to use a custom R install that is available for both Rstudio Workbench (on a virtual machine) and all of the institute cluster nodes (all share an NFS).

The custom R install was as folllows:

# As in https://docs.rstudio.com/resources/install-r-source/#build-and-install-r
R_VERSION=4.2.0
curl -O https://cran.rstudio.com/src/base/R-4/R-${R_VERSION}.tar.gz
tar -xzvf R-${R_VERSION}.tar.gz
cd R-${R_VERSION}
./configure --prefix=`pwd` --with-pcre1 --enable-memory-profiling --enable-R-shlib --with-blas --with-lapack
make
make install

# update r-versions visible by Rstudio
vim /etc/rstudio/r-versions

However, I get the following when trying to use clustermq to submit cluster jobs (the log from one such cluster job):

WARNING: ignoring environment value of R_HOME

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Failed to find installation of renv -- attempting to bootstrap...
Warning: unable to access index for repository https://cloud.r-project.org/src/contrib:
  cannot open URL 'https://cloud.r-project.org/src/contrib/PACKAGES'
* Downloading renv 0.9.2 ... Error in utils::download.file(url, destfile = destfile, mode = "wb", quiet = TRUE) :
  cannot open URL 'https://api.github.com/repos/rstudio/renv/tarball/0.9.2'
In addition: Warning message:
In utils::download.file(url, destfile = destfile, mode = "wb", quiet = TRUE) :
  URL 'https://api.github.com/repos/rstudio/renv/tarball/0.9.2': status was 'Couldn't resolve host name'
Warning message:
Failed to find an renv installation: the project will not be loaded.
Use `renv::activate()` to re-initialize the project.
> clustermq:::worker("tcp://morty:6044")
Error in readRDS(pfile) :
  cannot read workspace version 3 written by R 4.2.0; need R 3.5.0 or newer
Calls: ::: ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted

The cluster is using the R version located at /usr/local/bin/R instead of the custom install of R=4.2.0. Since R=3.4.4 is not R=4.2.0, renv tries to bootstrap the entire package environment and fails due to a lack of internet connection on the cluster node.

What is very odd is:

Error in readRDS(pfile) : cannot read workspace version 3 written by R 4.2.0; need R 3.5.0 or newer

mschubert commented 2 years ago

Thanks for narrowing this down, it confirms that the package installation issues are unrelated to clustermq.

Your users need to add the same R they are using in RStudio to their $PATH, or specify their R path in their template. Details here.