grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
32 stars 13 forks source link

WISH: Handle cache race conditions (e.g. "Multiple cache results found.") #76

Closed HenrikBengtsson closed 1 year ago

HenrikBengtsson commented 1 year ago

Issue

When running biomaRt concurrently in multiple R processes, there's a risk that the race conditions causes the at least one of the R processes to fail with an error. The risk for this to happen increases for users with access to more compute resources, e.g. an HPC environment.

Wish

Make the caching mechanism robust against cache race conditions such as "Multiple cache results found." (see below for an example).

Example

When running parallel, reverse dependency checks via the revdepcheck package, I at times get false-positive errors like:

I believe this is because of a race condition where two parallel R CMD check processes check the epimutacions against the (i) release version and the (ii) development version of the package I'm revdep checking (here matrixStats). These false positives require manual inspection. Rerunning the revdep checks for these packages sequentially resolves the issue - a process that is tedious, time-consuming, and potentially error-prone.

grimbough commented 1 year ago

Thanks for the report, I think this is a more general case of the issue in #74

In your case I think it would be sufficient for the cache ID to include the version of biomaRt, since that's different between release and devel, however it's not generally true that would be ok. I'll have a think about how to make the caching more robust to this situation. Presumably re-testing if a cache hit exists just before writing the cache would be somewhat better, as it eliminates the lag between testing and waiting for the BioMart server to return a result, but it's not perfect.

HenrikBengtsson commented 1 year ago

I don't know how the file cache is written/updated, but one way to lower the risk for race conditions is to write the cache file "atomically". Gist:

file <- "~/.cache/R/foo.rds"
## Cache file already exists?
if (file_test("-f", file)) return()

## Write to a temporary file in the same folder (on the same mount/disk)
file_tmp <- tempfile(pattern = file, tmpdir = dirname(file), fileext = ".tmp")
stopfinot(!file_test("-f", file_tmp))  ## "should not" happen

## Writing to file may take a long time, e.g. large objects and slow file systems
saveRDS(object, file = file_tmp)
 ## assert file was written and we didn't run out of disk space
stopfinot(
  file_test("-f", file_tmp),
  file.size(file_tmp) > 0
)

## Atomically rename
## (quick because source and destination is on the same disk)
file.rename(file_tmp, file)
stopfinot(file_test("-f", file))

I use this trick in a lot of place after gotten bitten when running large-scale analyses on slow parallel file systems.

grimbough commented 1 year ago

Hi @HenrikBengtsson I made some modifications to biomaRt, which now writes files directly to the cache directory, but only adds them to the cache database only after a second check that a duplicate entry isn't already there. If an entry has now appeared it simply deletes what was just written and uses the existing data. Has this improved the situation for you? Should be available both from here and the current devel on Bioconductor.

HenrikBengtsson commented 1 year ago

Thanks for this. I only have a vague understanding on how BiocFileCache works. But, looking at your https://github.com/grimbough/biomaRt/commit/94a1f92bb10f18c442e6320d0c51f2a499817762, I think I now understand how it works. I think your update lowers the risk for race conditions, especially for large objects and slow file systems.

I'll keep keeping an eye out for this when I run my revdep checks and report back. I'll close this one for now, since it might be months before I have enough data points to say something useful, but I assume things will work better now.