richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

Parallel processing for cache-wide methods #59

Open wlandau-lilly opened 6 years ago

wlandau-lilly commented 6 years ago

For cache-wide methods such as $clear() and especially $gc(), it would be handy to have some low-overhead mclapply()-powered parallel processing. I am sure @kendonB would appreciate this too.

x <- storr_rds("my_storr")
# ... cache accrues lots of files ...
x$gc(workers = 8) # parallelize over 8 forked processes

You would need to demote workers to 1 for Windows, but I think it is still worth it. parLapply() is platform-independent, but I personally do not like the overhead.

In drake, I use an internal lightly_parallelize() function quite a lot.

lightly_parallelize <- function(X, FUN, jobs = 1, ...) {
  jobs <- safe_jobs(jobs)
  if (is.atomic(X)){
    lightly_parallelize_atomic(X = X, FUN = FUN, jobs = jobs, ...)
  } else {
    mclapply(X = X, FUN = FUN, mc.cores = jobs, ...)
  }
}

lightly_parallelize_atomic <- function(X, FUN, jobs = 1, ...){
  jobs <- safe_jobs(jobs)
  keys <- unique(X)
  index <- match(X, keys)
  values <- mclapply(X = keys, FUN = FUN, mc.cores = jobs, ...)
  values[index]
}

safe_jobs <- function(jobs){
  ifelse(on_windows(), 1, jobs)
}

on_windows <- function(){
  this_os() == "windows"
}

this_os <- function(){
  Sys.info()["sysname"] %>%
    tolower %>%
    unname
}
wlandau-lilly commented 6 years ago

I forgot: $list() is an important one too.

richfitz commented 6 years ago

I believe that the disk I/O is the bottleneck for most of these and I'd be shocked if process level parallelism could speed that up

kendonB commented 6 years ago

The gpfs file system I'm on seems to have speed benefits for I/O heavy jobs up to around 100 workers.

Even personal hard drives have more than one read/write point I thought?