Closed HenrikBengtsson closed 7 years ago
Henrik,
Thanks for the detailed report. I'll change the result functions to act more consistently in the next days.
Regarding the file system latency: I guess you are hitting the attribute cache of NFS. Are you only experiencing problems for the results? If somehow possible, can you try if you get reliable results with one of:
file.exists = function(x) { isdir = file.info(x)$isdir; !is.na(isdir) & !isdir }
file.exists = function(x) basename(x) %in% list.files(dirname(x))
?
Using:
file_exists_1 <- function(x) { isdir <- file.info(x)$isdir; !is.na(isdir) & !isdir }
file_exists_2 <- function(x) { basename(x) %in% list.files(dirname(x)) }
on the first result file:
pathname <- file.path(reg$file.dir, "results", "1.rds")
I get:
List of 6
$ count : int 1
$ file.exists : logi FALSE
$ file_exists_1: logi FALSE
$ file_exists_2: logi TRUE
$ dt :Class 'difftime' atomic [1:1] 0.0138
.. ..- attr(*, "units")= chr "secs"
$ y :List of 3
..$ : NULL
..$ : NULL
..$ : NULL
List of 6
$ count : int 2
$ file.exists : logi TRUE
$ file_exists_1: logi TRUE
$ file_exists_2: logi TRUE
$ dt :Class 'difftime' atomic [1:1] 0.165
.. ..- attr(*, "units")= chr "secs"
$ y :List of 3
..$ : int 1
..$ : int 2
..$ : int 3
I suspect that the first call to file_exists_2(pathname)
triggers the NFS cache to be updated. This is in line with my previous comment where I noticed that dir(path = file.path(reg$file.path, "results"))
seems fix it.
I've started with a simple waitForFiles()
function here.
You can set fs.latency
in makeClusterFunctions*
to a positive number (e.g., 30) to let waitForJobs()
also wait for the results.
If the approach works, I will try to generalize and apply the same approach to other file system ops.
Would it make sense to make fs.latency=30
(the default) rather than zero? This would cover more users (one less thing to worry about). The downside is that users might not be / become aware of their NFS cache delay, but not sure if that's a big problem (e.g. I doubt they would go an optimize the settings to lower it anyway).
BTW, I wonder if there's could be a well defined / documented system call "out there" that triggers a blocking updating of the NFS cache? Maybe list.files() / dir()
calls this, which is why it appears to force NFS to be updated (at least on my system).
Just reporting back. Trying which what's on the master branch right now, I get:
> done <- waitForJobs(reg = reg)
Syncing 3 files ...
> done
[1] TRUE
> y <- reduceResultsList(reg = reg)
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file '/home/henrik/registry/results/1.rds', probable reason 'No such file or directory'
which I consider an improvement, because it's better to get an error than an incorrect value (= NULL).
And, sure, enough, I can refresh the NFS cache by calling:
> dir(file.path(reg$file.dir, "results"))
[1] "1.rds" "2.rds" "3.rds"
and after this, I can read the results:
> y <- reduceResultsList(reg = reg)
> str(y)
List of 3
$ : num 1
$ : num 2
$ : num 3
I've merged a heuristic into master and the default timeout is 65s (NFS keeps the cache up to 60s, so this hopefully works). I try a file.exists()
first and then call list.files()
if not all files were found in the first attempt. This is disabled for Interactive and Multicore as these backends do not involve multiple NFS hosts and a client should invalidate its own cache after writing to the directory. Let me know if you are also experiencing problems in these modes.
As you already noticed, I've changed the reduce functions in the master branch to throw an error if the result file is not found. Both changes put together should solve your issues on your system.
Sorry for bothering you with this stuff, but on all of my systems I cannot reproduce. And thanks for the in-depth analysis.
No worries, I'm happy to help out getting this package as stable as possible.
So, good news. Today I've put batchtools 0.9.1-9000 (commit 46e2bfee) for some real-world serious testing on our TOEQUE / PBS cluster. I did this via an early-version of future.batchtools which internally utilizes the batchtools package. This allowed me to run through the same system tests that I also run sequential, plain futures (on top of parallel package), and future.BatchJobs. Several of these tests also asserts identical results (regardless of computation backend). The tests runs for hours. I get all OK in these tests.
Feel free to close this issue whenever your done.
Great! Re-open if you encounter any problems.
Issue
Using
makeClusterFunctionsTorque()
on a TORQUE compute cluster, at firstreduceResultsList()
returns list ofNULL
elements but a bit later a list of the actual values. This even afterwaitForJobs()
returnsTRUE
. I suspect this is due to the infamous NFS delay and not polling the existence of the result files.Example
Sourcing the following
test.R
script:gives the following:
Troubleshooting / Suggestion
It looks like
batchtools:::.reduceResultsList()
, which is called byreduceResultsList()
, silently skips reading any result files for whichfile.exists(fns)
returnsFALSE
.If that's indeed the intended behavior, then I cannot tell whether the bug is in
reduceResultsList()
assuming all the results files should be there whenbatchtools:::.reduceResultsList()
is called or inwaitForJobs()
that should not return until all the result files are there.As my example code shows, I suspect that there's a delay in the network file system (NFS) causing the already written result files to not be visible from the master machine until several seconds later. The
file.exists()
suggests this.FYI, it looks like calling
dir(path = file.path(reg$file.path, "results"))
forces NFS to sync its view such thatfile.exists()
returnsTRUE
. However, I don't know whether that is a bullet proof solution.To me it seems like batchtools (
waitForJobs()
or something) needs to pool for the result files before trying they are queried.LATE UPDATES:
batchtools:::.loadResult()
and henceloadResult()
gives an error if the file is not there.help("reduceResultsList") that the "otherwise
NULL` behavior" is expected from that function.reduceResults()
will try to read the result files directly, so if they're not there an error will be generated. PS / feedback. The different default / error behavior toreduceResultsList()
is a bit confusing given their similarity in names.Session information
This is on a Scyld cluster with TORQUE / PBS + Moab.