r-lib / memoise

Easy memoisation for R
https://memoise.r-lib.org
Other
317 stars 59 forks source link

usage with multiple threads? #29

Open davharris opened 7 years ago

davharris commented 7 years ago

This looks like a great package. It's saving me and my collaborators a lot of unnecessary computation time.

I was wondering about how the package would perform if a memoized function were running in parallel on several threads, especially with caches stored on the filesystem. Given that the hashes are deterministic, it doesn't seem like there would be a problem, but I didn't see anything specifically about it in the documentation, so I thought it would be good to ask.

Thanks in advance!

jimhester commented 7 years ago

There is currently no support for this. In particular two processes could write to the same file simultaneously, producing a corrupted file. e.g. if both processes called a function with the same arguments at the same time. To avoid this you would need to use some sort of file locking, maybe with the flock package, although that package is not on CRAN, would need to be tested on windows and this would likely need its own cache, since file locking is only nessesary for multi-process code.

davharris commented 7 years ago

Thanks for the quick response!

chochkov commented 7 years ago

A possible implementation could be to maintain cache per worker using the process ID as a prefix to the filename. This way repeated calls in the workers would be indeed sped up.

I mostly have a parLapply workflow, however, which means that there's nothing to be sped up in the workers. Therefore it already brings a lot to me if I wrap the whole computation in a memoiseed function, roughly so:

fn <- function() {
  cl <- makeCluster(detectCores(), outfile='')
  tryCatch({
    result <- parLapply(cl, objects, FUN=my.slow.computation)
  }, finally=stopCluster(cl))
}
fn <- memoise(fn)
fn()

Perhaps that might help in other workflows too.

hadley commented 7 years ago

We could use @gaborcsardi's new file locking package

privefl commented 7 years ago

The package flock is on CRAN and it works really well.

npatwa commented 6 years ago

Just to add to the ideas already given, I secured the lock on the cache file just before the return statement in my memoised function and released the lock immediately after the function call(memoised function) in the calling environment

drag05 commented 5 months ago

My understanding is that flock uses disk caching while memoise function itself has caching options so it can be set to write to disk from any (parallel) process.

Is there a way to create new subprocesses inside parallel processes that can be used by memoise only?

From flock documentation It is still unclear to me what "process synchronization" refers to and why the need for parallel process "synchronization".

Thank you!