HenrikBengtsson commented 7 years ago

Issue

Using makeClusterFunctionsTorque() on a TORQUE compute cluster, at first reduceResultsList() returns list of NULL elements but a bit later a list of the actual values. This even after waitForJobs() returns TRUE. I suspect this is due to the infamous NFS delay and not polling the existence of the result files.

Example

Sourcing the following test.R script:

library("batchtools")
unlink("registry", recursive = TRUE)
stopifnot(!file_test("-d", "registry"))
reg <- makeRegistry("registry")
cat("#!/bin/bash\nRscript -e 'batchtools::doJobCollection(\"<%= uri %>\")'\n", file = "my.tmpl")
reg$cluster.functions <- makeClusterFunctionsTorque(template = "my.tmpl")
print(reg)

jobs <- batchMap(fun = identity, 1:3, reg = reg)
print(jobs)

submitJobs(reg = reg)

done <- waitForJobs(reg = reg)
stopifnot(done)

t0 <- Sys.time()
count <- 1L
repeat {
  y <- reduceResultsList(reg = reg)
  str(list(count = count, dt = Sys.time() - t0, y = y))
  if (!is.null(y[[1]])) break
  Sys.sleep(0.1)
  count <- count + 1L
}

gives the following:

> library("batchtools")
> unlink("registry", recursive = TRUE)
> stopifnot(!file_test("-d", "registry"))
> reg <- makeRegistry("registry")
> cat("#!/bin/bash\nRscript -e 'batchtools::doJobCollection(\"<%= uri %>\")'\n", file = "my.tmpl")
> reg$cluster.functions <- makeClusterFunctionsTorque(template = "my.tmpl")
> print(reg)
Job Registry
  Name    : Torque
  File dir: /home/henrik/batchtools-review/registry
  Work dir: /home/henrik/batchtools-review
  Jobs    : 0
  Seed    : 567656413

> jobs <- batchMap(fun = identity, 1:3, reg = reg)
Adding 3 jobs ...

> print(jobs)
   job.id
1:      1
2:      2
3:      3

> submitJobs(reg = reg)
Submitting 3 jobs in 3 chunks using cluster functions 'Torque' ...

> done <- waitForJobs(reg = reg)
> stopifnot(done)

> t0 <- Sys.time()
> count <- 1L
> repeat {
+   y <- reduceResultsList(reg = reg)
+   str(list(count = count, dt = Sys.time() - t0, y = y))
+   if (!is.null(y[[1]])) break
+   count < .... [TRUNCATED] 
List of 3
 $ count: int 1
 $ dt   :Class 'difftime'  atomic [1:1] 0.00505
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
List of 3
 $ count: int 2
 $ dt   :Class 'difftime'  atomic [1:1] 0.017
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
[...]
List of 3
 $ count: int 425
 $ dt   :Class 'difftime'  atomic [1:1] 4.87
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
List of 3
 $ count: int 426
 $ dt   :Class 'difftime'  atomic [1:1] 4.89
  .. ..- attr(*, "units")= chr "secs"
 $ y    :List of 3
  ..$ : int 1
  ..$ : int 2
  ..$ : int 3
>

Troubleshooting / Suggestion

It looks like batchtools:::.reduceResultsList(), which is called by reduceResultsList(), silently skips reading any result files for which file.exists(fns) returns FALSE.

If that's indeed the intended behavior, then I cannot tell whether the bug is in reduceResultsList() assuming all the results files should be there when batchtools:::.reduceResultsList() is called or in waitForJobs() that should not return until all the result files are there.

As my example code shows, I suspect that there's a delay in the network file system (NFS) causing the already written result files to not be visible from the master machine until several seconds later. The file.exists() suggests this.

FYI, it looks like calling dir(path = file.path(reg$file.path, "results")) forces NFS to sync its view such that file.exists() returns TRUE. However, I don't know whether that is a bullet proof solution.

To me it seems like batchtools (waitForJobs() or something) needs to pool for the result files before trying they are queried.

LATE UPDATES:

I see that batchtools:::.loadResult() and hence loadResult() gives an error if the file is not there.
I see from help("reduceResultsList") that the "otherwiseNULL` behavior" is expected from that function.
I see that reduceResults() will try to read the result files directly, so if they're not there an error will be generated. PS / feedback. The different default / error behavior to reduceResultsList() is a bit confusing given their similarity in names.

Session information

This is on a Scyld cluster with TORQUE / PBS + Moab.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] batchtools_0.9.1-9000 data.table_1.10.4     R.archive_0.3.0      

loaded via a namespace (and not attached):
 [1] prettyunits_1.0.2 digest_0.6.12     assertthat_0.1    R.methodsS3_1.7.1
 [5] rappdirs_0.3.1    R6_2.2.0          backports_1.0.5   magrittr_1.5     
 [9] progress_1.1.2    stringi_1.1.2     R.oo_1.21.0       R.utils_2.5.0    
[13] brew_1.0-6        checkmate_1.8.2   tools_3.3.2

mllg commented 7 years ago

Henrik,

Thanks for the detailed report. I'll change the result functions to act more consistently in the next days.

Regarding the file system latency: I guess you are hitting the attribute cache of NFS. Are you only experiencing problems for the results? If somehow possible, can you try if you get reliable results with one of:

file.exists = function(x) { isdir = file.info(x)$isdir; !is.na(isdir) & !isdir }
file.exists = function(x) basename(x) %in% list.files(dirname(x))

?

HenrikBengtsson commented 7 years ago

Using:

file_exists_1 <- function(x) { isdir <- file.info(x)$isdir; !is.na(isdir) & !isdir }
file_exists_2 <-  function(x) { basename(x) %in% list.files(dirname(x)) }

on the first result file:

pathname <- file.path(reg$file.dir, "results", "1.rds")

I get:

List of 6                                                                     
 $ count        : int 1
 $ file.exists  : logi FALSE
 $ file_exists_1: logi FALSE
 $ file_exists_2: logi TRUE
 $ dt           :Class 'difftime'  atomic [1:1] 0.0138
  .. ..- attr(*, "units")= chr "secs"
 $ y            :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
List of 6
 $ count        : int 2
 $ file.exists  : logi TRUE
 $ file_exists_1: logi TRUE
 $ file_exists_2: logi TRUE
 $ dt           :Class 'difftime'  atomic [1:1] 0.165
  .. ..- attr(*, "units")= chr "secs"
 $ y            :List of 3
  ..$ : int 1
  ..$ : int 2
  ..$ : int 3

I suspect that the first call to file_exists_2(pathname) triggers the NFS cache to be updated. This is in line with my previous comment where I noticed that dir(path = file.path(reg$file.path, "results")) seems fix it.

mllg commented 7 years ago

I've started with a simple waitForFiles() function here. You can set fs.latency in makeClusterFunctions* to a positive number (e.g., 30) to let waitForJobs() also wait for the results.

If the approach works, I will try to generalize and apply the same approach to other file system ops.

HenrikBengtsson commented 7 years ago

Would it make sense to make fs.latency=30 (the default) rather than zero? This would cover more users (one less thing to worry about). The downside is that users might not be / become aware of their NFS cache delay, but not sure if that's a big problem (e.g. I doubt they would go an optimize the settings to lower it anyway).

HenrikBengtsson commented 7 years ago

BTW, I wonder if there's could be a well defined / documented system call "out there" that triggers a blocking updating of the NFS cache? Maybe list.files() / dir() calls this, which is why it appears to force NFS to be updated (at least on my system).

HenrikBengtsson commented 7 years ago

Just reporting back. Trying which what's on the master branch right now, I get:

> done <- waitForJobs(reg = reg)                                              
Syncing 3 files ...
> done
[1] TRUE
> y <- reduceResultsList(reg = reg)
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/henrik/registry/results/1.rds', probable reason 'No such file or directory'

which I consider an improvement, because it's better to get an error than an incorrect value (= NULL).

And, sure, enough, I can refresh the NFS cache by calling:

> dir(file.path(reg$file.dir, "results"))
[1] "1.rds" "2.rds" "3.rds"

and after this, I can read the results:

> y <- reduceResultsList(reg = reg)
> str(y)
List of 3
 $ : num 1
 $ : num 2
 $ : num 3

mllg commented 7 years ago

I've merged a heuristic into master and the default timeout is 65s (NFS keeps the cache up to 60s, so this hopefully works). I try a file.exists() first and then call list.files() if not all files were found in the first attempt. This is disabled for Interactive and Multicore as these backends do not involve multiple NFS hosts and a client should invalidate its own cache after writing to the directory. Let me know if you are also experiencing problems in these modes.

As you already noticed, I've changed the reduce functions in the master branch to throw an error if the result file is not found. Both changes put together should solve your issues on your system.

Sorry for bothering you with this stuff, but on all of my systems I cannot reproduce. And thanks for the in-depth analysis.

HenrikBengtsson commented 7 years ago

No worries, I'm happy to help out getting this package as stable as possible.

So, good news. Today I've put batchtools 0.9.1-9000 (commit 46e2bfee) for some real-world serious testing on our TOEQUE / PBS cluster. I did this via an early-version of future.batchtools which internally utilizes the batchtools package. This allowed me to run through the same system tests that I also run sequential, plain futures (on top of parallel package), and future.BatchJobs. Several of these tests also asserts identical results (regardless of computation backend). The tests runs for hours. I get all OK in these tests.

HenrikBengtsson commented 7 years ago

Feel free to close this issue whenever your done.

mllg commented 7 years ago

Great! Re-open if you encounter any problems.

mllg / batchtools

BUG: Post waitForJobs(), reduceResultsList() returns list with `NULL`:s and a bit later the values #85

Issue

Example

Troubleshooting / Suggestion

Session information