Closed HenrikBengtsson closed 1 year ago
Thanks for the report, I think this is a more general case of the issue in #74
In your case I think it would be sufficient for the cache ID to include the version of biomaRt, since that's different between release and devel, however it's not generally true that would be ok. I'll have a think about how to make the caching more robust to this situation. Presumably re-testing if a cache hit exists just before writing the cache would be somewhat better, as it eliminates the lag between testing and waiting for the BioMart server to return a result, but it's not perfect.
I don't know how the file cache is written/updated, but one way to lower the risk for race conditions is to write the cache file "atomically". Gist:
file <- "~/.cache/R/foo.rds"
## Cache file already exists?
if (file_test("-f", file)) return()
## Write to a temporary file in the same folder (on the same mount/disk)
file_tmp <- tempfile(pattern = file, tmpdir = dirname(file), fileext = ".tmp")
stopfinot(!file_test("-f", file_tmp)) ## "should not" happen
## Writing to file may take a long time, e.g. large objects and slow file systems
saveRDS(object, file = file_tmp)
## assert file was written and we didn't run out of disk space
stopfinot(
file_test("-f", file_tmp),
file.size(file_tmp) > 0
)
## Atomically rename
## (quick because source and destination is on the same disk)
file.rename(file_tmp, file)
stopfinot(file_test("-f", file))
I use this trick in a lot of place after gotten bitten when running large-scale analyses on slow parallel file systems.
Hi @HenrikBengtsson I made some modifications to biomaRt, which now writes files directly to the cache directory, but only adds them to the cache database only after a second check that a duplicate entry isn't already there. If an entry has now appeared it simply deletes what was just written and uses the existing data. Has this improved the situation for you? Should be available both from here and the current devel on Bioconductor.
Thanks for this. I only have a vague understanding on how BiocFileCache works. But, looking at your https://github.com/grimbough/biomaRt/commit/94a1f92bb10f18c442e6320d0c51f2a499817762, I think I now understand how it works. I think your update lowers the risk for race conditions, especially for large objects and slow file systems.
I'll keep keeping an eye out for this when I run my revdep checks and report back. I'll close this one for now, since it might be months before I have enough data points to say something useful, but I assume things will work better now.
Issue
When running biomaRt concurrently in multiple R processes, there's a risk that the race conditions causes the at least one of the R processes to fail with an error. The risk for this to happen increases for users with access to more compute resources, e.g. an HPC environment.
Wish
Make the caching mechanism robust against cache race conditions such as "Multiple cache results found." (see below for an example).
Example
When running parallel, reverse dependency checks via the revdepcheck package, I at times get false-positive errors like:
checking re-building of vignette outputs ... ERROR
I believe this is because of a race condition where two parallel
R CMD check
processes check the epimutacions against the (i) release version and the (ii) development version of the package I'm revdep checking (here matrixStats). These false positives require manual inspection. Rerunning the revdep checks for these packages sequentially resolves the issue - a process that is tedious, time-consuming, and potentially error-prone.