Closed bschilder closed 1 year ago
Splitting did not help, still caused a huge explosion of memory usage.
Found these tutorials on parallelising in R: https://yxue-me.com/post/2019-05-12-a-glossary-of-parallel-computing-packages-in-r-2019/ https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html
future
tutorial:
https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html
Forking actually copies the entire R environment, not just the variables inside each loop.
Packages:
furrr
seems to have some useful methods, like future_map2
which has a built in progress bar and works directly with future
:
https://furrr.futureverse.org/
http://zevross.com/blog/2019/02/12/dramatically-speed-up-your-r-purrr-functions-with-the-furrr-package/
parallel::mclapply
This is the least efficient method, as I include the gene_data_split
inside the loop which increases the memory usage by ~1Gb every 2 seconds or so.
overlap <- parallel::mclapply(seq_len(length(gene_data_split)),
function(i){
list_name <- names(gene_data_split)[[i]]
d <- gene_data_split[[i]]
gen_overlap_test(ct_genes = ct_genes,
list_name = list_name,
dgenes = unique(d$Gene),
long_format = long_format,
bg = bg,
verbose = FALSE)
}, mc.cores = cores, mc.) |>
data.table::rbindlist(use.names = TRUE,
idcol = list_name_column)
furrr::future_map2
This is more efficient due to not passing gene_data_split
into the loop. But memory usage is still increasing by ~.3Gb every 2 seconds or so.
future::plan(future::multisession, workers = cores)
overlap <- furrr::future_map2(.x = gene_data_split,
.y = names(gene_data_split),
.progress = TRUE,
.f = function(.x,.y){
gen_overlap_test(ct_genes = ct_genes,
list_name = .y,
dgenes = unique(.x$Gene),
long_format = long_format,
bg = bg,
verbose = FALSE)
})|>
data.table::rbindlist(use.names = TRUE,
idcol = list_name_column)
furrr::future_map
Only passing in .x, and not .y. So is a bit more efficient than future_map
. Increases mem usage by ~.1-.3Gb/2s.
future::plan(future::multisession, workers = cores)
overlap <- furrr::future_map(.x = gene_data_split,
.progress = TRUE,
.f = function(.x){
gen_overlap_test(ct_genes = ct_genes,
dgenes = unique(.x$Gene),
long_format = long_format,
bg = bg,
verbose = FALSE)
})|>
data.table::rbindlist(use.names = TRUE,
idcol = list_name_column)
Further improvements in efficiency were achieved through reducing the size of the ct_genes
obj (only store gene names, not quantiles), and not storing the union gene set in the returned results object.
This gets the memory usage to only grow at ~.1Gb/5-14 seconds. Much better! At this rate, I might actually be able to finish all iterations before memory maxes out at >252Gb. I also noticed this rate tends to get a bit slower as time goes on.
data.table
A non-parallel option that is still very fast and does not consume increasing amounts of mem over time.
Takes advantage of the by
and performs the grouped function calls within-object, which is more efficient that writing an lapply
across a list of chunks.
overlap <- gene_data[,gen_overlap_test(ct_genes = ct_genes,
list_name = get(list_name_column),
dgenes = Gene,
long_format = long_format,
bg = bg),
by=c(list_name_column)]
Parallelising grouped operating within a data.table
has been discussed as well:
https://github.com/Rdatatable/data.table/issues/5376
https://github.com/Rdatatable/data.table/issues/5077
https://github.com/Rdatatable/data.table/issues/4284
https://github.com/Rdatatable/data.table/issues/4077
It's also worth noting that you can use keyby
instead of by
. This seems to have more to do with how the table is ordered afterwards (via setting the key) than any performance diffs when doing grouped operations:
https://stackoverflow.com/questions/61755966/data-table-summarizing-data-difference-between-by-and-keyby
https://github.com/Rdatatable/data.table/issues/5602
Using future
-based packages like furrr
opens up the possibility of freeing up the zombified cores by simply running this command. This means i don't have to restart my entire R session each time!
fut <- future::plan(future::multisession, workers = cores)
future::resetWorkers(fut)
This even works if I forgot to save the plan as a variable
future::plan(future::multisession, workers = cores)
future::resetWorkers([fut](future::plan(future::multisession, workers = cores)))
Another thing to keep in mind here is that the threadripper is different from other kinds of clusters in that it is a shared memory resource. This complicates things and more care needs to be taken to avoid issues due to this.
Parallelising over groups in R (e.g. each phenotype's gene list) can use a ton of memory, because you can accidentally copy all objects used inside the loop, which multiplies your memory usage when working with large datasets.
Even on large-memory machines like the Threadripper (250Gb) this quickly gets overloaded bc each parallel process uses 17.6Gb of memory to store the entire
gene_data
obj over and over again. 17.6 * 50 cores = 835Gb of memory! This means that parallelising chokes up the threadripper's memory and slows it down to a grinding halt, making processing the data even slower than if you had just single-threaded it. This is exactly what happened when i tried this version of thegen_overlap
func: https://github.com/neurogenomics/MultiEWCE/blob/25d26a41096902607a4f595343f2f585dad9f819/R/gen_overlap.R#L49A better way might be to split the data first into chunks, and then iterate over the chunks:
Related posts
https://stackoverflow.com/questions/19082794/speed-up-data-table-group-by-using-multiple-cores-and-parallel-programming https://stackoverflow.com/questions/14759905/data-table-and-parallel-computing
👇 Contains some useful benchmarking https://github.com/Rdatatable/data.table/issues/2223
Killing zombie processes
This happens when you launch some paralleised func in R and then decide to stop it midway. A bunch of "zombie" processes are leftover.
Restarting the R session
Sometimes this works, other times not so much. Might not work when memory is so overloaded that you can't restart the R session.
via htop
Didn't seem to work.
Via
inline
Someone suggested this, but didn't seem to do anything for me on the Threadripper.
https://stackoverflow.com/questions/25388139/r-parallel-computing-and-zombie-processes
via
future
Using
future
instead ofparallel
might give me more control over this, but it has to done before launching the processes. https://github.com/HenrikBengtsson/future/issues/93via Docker
To kill them, the only effective way I've found is to restart the container:
@Al-Murphy probably relevant to you too