mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

Error in readLog() on an SGE cluster #197

Closed wlandau closed 6 years ago

wlandau commented 6 years ago

I recently started encountering errors submitting to an SGE cluster via future.batchtools, future.apply, and this template file.

library(future.apply)
library(future.batchtools)
plan(batchtools_sge, template = "sge-simple.tmpl")
future_lapply(1:2, sqrt)
#>
#> Error in readLog(id, reg = reg) :
#>   Log file for job with id 1 not available
traceback()
#> 12: stop(simpleError(sprintf(...), call = sys.call(sys.parent())))
#> 11: stopf("Log file for job with id %i not available", id$job.id)
#> 10: readLog(id, reg = reg)
#> 9: extractLog(readLog(id, reg = reg), id)
#> 8: getLog(id = jobid, reg = reg)
#> 7: await.BatchtoolsFuture(future, cleanup = FALSE)
#> 6: await(future, cleanup = FALSE)
#> 5: value.BatchtoolsFuture(tmp, ...)
#> 4: value(tmp, ...)
#> 3: values.list(fs)
#> 2: values(fs)
#> 1: future_lapply(1:2, sqrt)
sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux
#> 
#> Matrix products: default
#> BLAS: <CENSORED>/R-3.5.0/lib64/R/lib/libRblas.so
#> LAPACK: <CENSORED>/R-3.5.0/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#> 
#> other attached packages:
#> [1] future.batchtools_0.7.0 future.apply_0.2.0      future_1.8.1
#> [4] usethis_1.3.0           testthat_2.0.0.9000     devtools_1.13.5
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.17      magrittr_1.5      progress_1.1.2    rappdirs_0.3.1
#>  [5] debugme_1.1.0     R6_2.2.2          brew_1.0-6        rlang_0.2.1
#>  [9] globals_0.11.0    tools_3.5.0       parallel_3.5.0    checkmate_1.8.5
#> [13] data.table_1.11.4 withr_2.1.2       base64url_1.4     digest_0.6.15
#> [17] assertthat_0.2.0  crayon_1.3.4      fs_1.2.2          codetools_0.2-15
#> [21] batchtools_0.9.10 memoise_1.1.0     stringi_1.2.2     compiler_3.5.0
#> [25] backports_1.1.2   prettyunits_1.0.2 listenv_0.7.0
wlandau commented 6 years ago

This looks like a batchtools problem, but it could also have something to do with the future ecosystem. I am not sure. cc @HenrikBengtsson.

mllg commented 6 years ago

Is this reproducible? Could be a problem with file system latency.

NB: I've improved the error message in 804dcc3b3dc870b9bc0ef5ed42d6cf90c89fc3b5

mllg commented 6 years ago

c97672eb12ad9743988a55637007f503a03e6be0 eventually solves this if this is an issue with the file system latency.

wlandau commented 6 years ago

I do not know if the same thing happens on other machines. I only have access to my company's SGE cluster, and the problem persists even after updating to https://github.com/mllg/batchtools/commit/c97672eb12ad9743988a55637007f503a03e6be0.

mllg commented 6 years ago

With the latest commit you should get a more informative error message. Can you check if the reported log file exists on the file system?

wlandau commented 6 years ago

Yes, I do see a more informative error message after updating batchtools to the latest commit.

library(future.apply)
library(future.batchtools)
plan(batchtools_sge, template = "sge-simple.tmpl")
future_lapply(1:2, sqrt)
Error: Log file '/home/<CENSORED>/scratch/.future/20180607_111654-OPL6Ge/batchtools_128392604/logs/jobafa83af0c7534ee2be6cb550d768bbad.log' for job with id 1 not available

And sure enough, the folder .future/20180607_111654-OPL6Ge/batchtools_128392604/logs exists, but it is empty.

SGE did write log files (job4b1879cb08606227955b9beb24628cdb.job.o29021485 and jobafa83af0c7534ee2be6cb550d768bbad.job.o29021482) to the working directory of the R session, and those look fine.

### [bt]: This is batchtools v0.9.10.9000
### [bt]: Starting calculation of 1 jobs
### [bt]: Setting working directory to '/home/<CENSORED>/scratch'
### [bt]: Memory measurement disabled
### [bt]: Starting job [batchtools job.id=1]
### [bt]: Setting seed to 1 ...

### [bt]: Job terminated successfully [batchtools job.id=1]
### [bt]: Calculation finished!

The traceback in the R session is basically the same:

> traceback()
16: stop(simpleError(sprintf(...), call = NULL))
15: stopf("Log file '%s' for job with id %i not available", log.file,
        id$job.id)
14: readLog(id, reg = reg)
13: extractLog(readLog(id, reg = reg), id)
12: getLog(id = jobid, reg = reg)
11: await.BatchtoolsFuture(future, cleanup = FALSE)
10: await(future, cleanup = FALSE)
9: value.BatchtoolsFuture(tmp, ...)
8: value(tmp, ...)
7: values.list(fs)
6: values(fs)
5: future_lapply(1:2, sqrt) at reprex.R#5
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("reprex.R")
mllg commented 6 years ago

Ok, we are getting there...

SGE did write log files (job4b1879cb08606227955b9beb24628cdb.job.o29021485 and jobafa83af0c7534ee2be6cb550d768bbad.job.o29021482) to the working directory of the R session, and those look fine.

This is the wrong directory. You need the following line in your template to redirect the output (c.f. https://github.com/mllg/batchtools/blob/master/inst/templates/sge-simple.tmpl):

#$ -j y
#$ -o <%= log.file %>
wlandau commented 6 years ago

It worked, thanks!

wlandau commented 6 years ago

Does the following look right to you for SLURM? I do not have access to SLURM, so I cannot test it.

https://github.com/ropensci/drake/blob/b148baccf2c0a6e454acbb0701567955b1ed8d20/inst/examples/slurm/batchtools.slurm.tmpl#L21-L23