Error in readLog() on Slurm cluster

stuvet commented 3 years ago

Problem

I'll cross-post a related feature request for future.batchtools, but I'm reliably getting:

Error : Log file ... for job with id 1 not available

when submitting jobs using future.bachtools on a small slurm cluster.

Importantly, it only ever happens when a new machine has to be recruited. Because I allow certain partitions to scale down to 0 machines when not in use I reliably get this error the first time I submit code to these partitions, either by future.batchtools, or directly from batchtools. Jobs run as expected once a node is running & idle.

The log files do exist, and the jobs continue to run on the cluster after batchtoools::readLogs errors.

Environment

Slurm cluster on Google compute engines, set up like this with near-default config, but with preemptible partitions which can scale down to 0 machines.

SessionInfo

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /apps/R/4.1.0/lib64/R/lib/libRblas.so
LAPACK: /apps/R/4.1.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1]  forcats_0.5.1               stringr_1.4.0               purrr_0.3.4                googleCloudStorageR_0.6.0 
 [5] readr_1.4.0                 tidyr_1.1.3                 tibble_3.1.2                ggplot2_3.3.3              
 [9] tidyverse_1.3.1             brms_2.15.0                 Rcpp_1.0.6                  dplyr_1.0.6                
[13] future.batchtools_0.10.0    batchtools_0.9.15           stringi_1.6.2               future_1.21.0              
[17] stantargets_0.0.2.9000      tarchetypes_0.2.0           targets_0.4.2               aws.s3_0.3.21

Template

#!/bin/bash

## Job Resource Interface Definition
##
## ntasks [integer(1)]:       Number of required tasks,
##                            Set larger than 1 if you want to further parallelize
##                            with MPI within your job.
## ncpus [integer(1)]:        Number of required cpus per task,
##                            Set larger than 1 if you want to further parallelize
##                            with multicore/parallel within each task.
## walltime [integer(1)]:     Walltime for this job, in seconds.
##                            Must be at least 60 seconds for Slurm to work properly.
## memory   [integer(1)]:     Memory in megabytes for each cpu.
##                            Must be at least 100 (when I tried lower values my
##                            jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= ceiling(resources$walltime / 60) %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
<%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>
## Initialize work environment like
## source /etc/profile
## module add ...

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

Potential Fix

Jobs reliably take 60 seconds (almost always) to an outside 69 seconds to begin after submission when a partition has to recruit a new machine. This means the waitForFile times out with the default fs.latency=65. I've played around with both scheduler.latency & fs.latency, and extending fs.latency even to 260 seconds doesn't solve the problem (why? Is waitForFile getting an unexpected value for path by trying to run fs::path_dir(fn) before the machine is fully set up?). My problem is solved by increasing scheduler.latency to 60-70 secnods, which allows me to drop fs.latency down to 10-20.

This solves my problem, but makes batchtools slow to recognise that the job has completed.

Feature Request

Divide scheduler.latency into two figures - one for the initial sleep, and one for subsequent responses.

stuvet commented 3 years ago

This problem has been solved by the pull request here, though it may reveal another bug at future.batchtools as outlined here.

After this bugfix I can run batchtools jobs on a slurm cluster as expected, even when the machine needs to be provisioned, but despite adapting my slurm cluster nfs settings to sync as hinted at here, I still need to use a fs.latency of 70-80 seconds though I can now use a scheduler.latency of 1.

kpj commented 2 years ago

@stuvet: Thanks for looking into this in more detail! I am currently experiencing seemingly the same issue (Error: Log file '[..]' for job with id 1 not available) with LSF. Manually setting scheduler.latency = 60 in makeClusterFunctionsLSF did also solve the issue in my case. As some of the listed PRs are SLURM specific (https://github.com/mllg/batchtools/pull/277/files) while others should also affect LSF (https://github.com/mllg/batchtools/pull/276/files), I am wondering whether you'd have a suggestion which changes would need to be applied to the LSF system as well.

stuvet commented 2 years ago

Hi @kpj I'm sorry I have no idea about LSF, but I was mistaken thinking that I needed fs.latency = 60 - this hid the problem for a while, but it wasn't the root cause.

The main problem for me was the mismatch between the full log file path passed to waitForFile and the basename recovered by list.files without the full.names argument. It would only hit that issue if the log file hadn't already been registered when readLog was first called, so I suspect there was a race condition that only manifested when I had to wait for new workers to be recruited. I also tried to change some of the network mount flags in my Slurm cluster to try to reduce filesystem latency. I'm not sure if this helped, but It may be worth you checking this if you have access.

I also noticed that some of the Slurm status codes were missing from makeClusterFunctionsSlurm though I think this resulted in a different error. Perhaps check you're on the most recent version of batchtools, and check that the makeClusterFunctionsLSF listJobs calls are likely to return all the jobs you might expect? On that note, do you see a delay between submitting the job and the job running on LSF? Does the job continue to run after batchtools has hit the error? If so, I'd be suspicious that the listJobs or listJobsQueued functions in makeClusterFunctionsLSF aren't finding all the jobs. Do you have any way to test this? If you find this case, you could get around it by pass a custom set of functions to makeClusterFunctions rather than relying on the default makeClusterFunctionsLSF.

Sorry I can't be more helpful, but hopefully it gives you some options about where to look next.

kpj commented 2 years ago

Thanks a lot for your response!

I also tried to change some of the network mount flags in my Slurm cluster to try to reduce filesystem latency. I'm not sure if this helped, but It may be worth you checking this if you have access.

Unfortunately, I don't have access and am used to network latency related issues.

check you're on the most recent version of batchtools

I cloned mllg/batchtools and am on the latest commit (63a6f812927f6dc78595abbffac0f6bc9651f67a).

check that the makeClusterFunctionsLSF listJobs calls are likely to return all the jobs you might expect

When I print stri_extract_first_regex(tail(res$output, -1L), "\\d+") in makeClusterFunctionsLSF$listJobs (i.e. what the function would return), it does indeed look a bit weird. Sometimes there are NA entries. I'll have to look into this in more detail to understand why and if that's an issue. But this just executes LSF's bjobs so not too much should be able to go wrong.

do you see a delay between submitting the job and the job running on LSF?

What kind of delay do you mean? makeClusterFunctionsLSF$submitJob seems to only call runOSCommand("bsub", stdin = outfile) so any delay (i.e. the job being PENDING before RUNNING) should be due to LSF scheduling. But yes, this kind of delay I do indeed observe.

Does the job continue to run after batchtools has hit the error?

Yes. E.g., right now I still have a few jobs running while my furrr::future_map_dfr call crashed over an hour ago.

Sorry I can't be more helpful, but hopefully it gives you some options about where to look next.

No worries, you already gave me some useful pointers :-)

stuvet commented 2 years ago

What kind of delay do you mean?

In my case it was a provisioning delay, because the Slurm partition autoscaled down to 0 workers when not in use, so there was 60-90 sec when the worker was being recruited before a job could start. During this 60 sec Slurm would return a pending status code for the job that wasn't being handled properly by makeClusterFunctionsSlurm, & batchtools would exit with an error while the job continued to run on Slurm.

Sometimes there are NA entries.

Sounds like it may be worth chasing up that makeClusterFunctionsLSF$listJobs NA values to see if they'd be handled correctly by batchtools, or more specifically that all possible LSF job status codes are captured either via listJobsRunning or via listJobsQueued. One of my issues came because future.batchtools (which operates via batchtools) wasn't handling missing status values gracefully as it assumed the process had exited, whilst if called directly batchtools would have waited a bit longer before polling job status. Despite that, the best solution was to update the Slurm queries in makeClusterFunctionsSlurm to map all possible Slurm status codes to either the listJobsRunning or listJobsQueued functions.

Best of luck with it! This took me ages to debug, especially since I had to track these errors through 3 different R packages & troubleshoot an HPC system that I didn't understand at all!

kpj commented 2 years ago

In my case it was a provisioning delay,

I see, very interesting! I think this is not the case for me.

Sounds like it may be worth chasing up that makeClusterFunctionsLSF$listJobs NA values

When I explicitly removed NA entries from the list of returned job ids, I ended up running into the BatchtoolsExpiration which you and a few others apparently also encountered before. I ended up creating a new issue for that on mllg/batchtools (https://github.com/mllg/batchtools/issues/285).

future.batchtools (which operates via batchtools) wasn't handling missing status values gracefully

As far as I can tell makeClusterFunctionsLSF is not really using explicit status values (except when determining whether no jobs exist) and uses the -r and -p commandline parameters of bjobs instead. I'll have to check what the implications of this are.

Best of luck with it! This took me ages to debug, especially since I had to track these errors through 3 different R packages & troubleshoot an HPC system that I didn't understand at all!

Thanks a lot, this is very reassuring 😄

stuvet commented 2 years ago

I ended up running into the BatchtoolsExpiration

Now you're definitely down the right track! (I hope...). I'd leave the NAs in, and try to work out how they're being produced - do those -r & -p flags bring up all possible jobs, or do you have permission to call them as $USER?

As far as I can tell makeClusterFunctionsLSF is not really using explicit status values (except when determining whether no jobs exist) and uses the -r and -p commandline parameters of bjobs instead. I'll have to check what the implications of this are.

Options for bjobs - Looking at this, I'd take a look at the -q flag as I'd bet that these fall through the -p & -r net, & it'd be a quick fix if so. It might also be worth checking the -pe, -pi & -pei flags to make sure these are all captured by -p. It also looks like suspended jobs would fall through the net, but this probably isn't common.

kpj commented 2 years ago

Sorry for the late response, our cluster was down...

I'd leave the NAs in, and try to work out how they're being produced

The NAs apparently come from the interspersed message New job is waiting for scheduling; when calling bjobs with -p:

JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
212458300 [..]   PEND  gpu.4h     [..]    -        job615bf5b3fea0c6005c9bd24770e18aea Apr  7 15:46
 New job is waiting for scheduling;
212458315 [..]   PEND  gpu.4h     [..]    -        jobad05dfe510402c7a7c99375441024b54 Apr  7 15:47
 New job is waiting for scheduling;
212458328 [..]   PEND  gpu.4h     [..]    -        jobffeb247934663944e4144a59c46a333b Apr  7 15:47
 New job is waiting for scheduling;

do those -r & -p flags bring up all possible jobs, or do you have permission to call them as $USER?

The whole command being executed is actually bjobs -u $USER -w -p, so I do only observe my own jobs.

I'd take a look at the -q flag as I'd bet that these fall through the -p & -r net

-q sets the queue for which to list jobs. How could that help in this case?

It might also be worth checking the -pe, -pi & -pei flags to make sure these are all captured by -p.

In my tests, -p did indeed show the union of -pe, -pi, -pei so this seems fine as well.

It also looks like suspended jobs would fall through the net, but this probably isn't common.

That definitely something to check later on, but for now none of my jobs get suspended while triggering the crash.

stuvet commented 2 years ago

-q sets the queue for which to list jobs. How could that help in this case?

Ah I misread the -q description - I'd understood it to list queued jobs.

The NAs apparently come from the interspersed message New job is waiting for scheduling; when calling bjobs with -p:

Interesting. I wonder if the batchtools registry sees 3 jobs in your example, or 6 (3 without the appropriate job ids & statuses etc)?

kpj commented 2 years ago

Interesting. I wonder if the batchtools registry sees 3 jobs in your example, or 6 (3 without the appropriate job ids & statuses etc)?

To get the list of running jobs, batchtools looks at the result of chsetdiff: https://github.com/mllg/batchtools/blob/63a6f812927f6dc78595abbffac0f6bc9651f67a/R/clusterFunctions.R#L271-L276

chsetdiff is defined as follows (where chmatch apparently comes from data.table): https://github.com/mllg/batchtools/blob/63a6f812927f6dc78595abbffac0f6bc9651f67a/R/helpers.R#L130-L133

Simulating a call to getBatchIds(reg, status = "queued" would result in the evaluation of something like chsetdiff(c("a", "b", NA, NA), c()) which indeed yields "a" "b" NA NA.

But in the end they remove NA entries so it seems all good: https://github.com/mllg/batchtools/blob/63a6f812927f6dc78595abbffac0f6bc9651f67a/R/clusterFunctions.R#L279

I guess I have to keep searching 😬

kpj commented 2 years ago

FWIW, I gave up on debugging this complex network of packages and ended up implementing my own LSF scheduling wrapper. It is not super sophisticated but works for me and allows writing reasonably concise code such as

foo <- 42
cluster_apply(data.frame(i = 1:3, group = c('A', 'B', 'C')), function(i, group) {
  log_debug("hello")
  data.frame(group = group, i = i, foo = foo, result = foo + 2 * i)
}, .packages = c(logger))

HenrikBengtsson commented 1 year ago

FWIW, in the next release of future.batchtools, you can tweak the underlying cluster functions parameters, e.g.

plan(batchtools_slurm, scheduler.latency = 60, fs.latency = 20)

Until on CRAN, you can install the develop version using:

remotes::install_github("HenrikBengtsson/future.batchtools", ref="develop")

I'm not sure if this fixes any of the problems here, but thought I comment in case it does.

mllg / batchtools