neurogenomics / MSTExplorer

Multi-Scale Target Explorer systematically identifies, prioritises, and visualises cell-type-specific gene therapy targets across the phenome.
https://neurogenomics.github.io/MSTExplorer/
1 stars 1 forks source link

Parallelising in R efficiently #8

Closed bschilder closed 1 year ago

bschilder commented 1 year ago

Parallelising over groups in R (e.g. each phenotype's gene list) can use a ton of memory, because you can accidentally copy all objects used inside the loop, which multiplies your memory usage when working with large datasets.

gene_data <- HPOExplorer::load_phenotype_to_genes()
gene_data[,HPO_ID.LinkID:=paste(HPO_ID,LinkID,sep=".")]
nrow(gene_data) # 968,116
length(unique(gene_data$HPO_ID.LinkID)) #621,906

Even on large-memory machines like the Threadripper (250Gb) this quickly gets overloaded bc each parallel process uses 17.6Gb of memory to store the entire gene_data obj over and over again. 17.6 * 50 cores = 835Gb of memory! This means that parallelising chokes up the threadripper's memory and slows it down to a grinding halt, making processing the data even slower than if you had just single-threaded it. This is exactly what happened when i tried this version of the gen_overlap func: https://github.com/neurogenomics/MultiEWCE/blob/25d26a41096902607a4f595343f2f585dad9f819/R/gen_overlap.R#L49

A better way might be to split the data first into chunks, and then iterate over the chunks:

 split.data.table <- utils::getFromNamespace("split.data.table","data.table")
gene_data_split <- split.data.table(x = gene_data,
                                    by ="HPO_ID.LinkID")

Related posts

https://stackoverflow.com/questions/19082794/speed-up-data-table-group-by-using-multiple-cores-and-parallel-programming https://stackoverflow.com/questions/14759905/data-table-and-parallel-computing

👇 Contains some useful benchmarking https://github.com/Rdatatable/data.table/issues/2223

Killing zombie processes

This happens when you launch some paralleised func in R and then decide to stop it midway. A bunch of "zombie" processes are leftover.

Restarting the R session

Sometimes this works, other times not so much. Might not work when memory is so overloaded that you can't restart the R session.

via htop

Didn't seem to work.

Via inline

Someone suggested this, but didn't seem to do anything for me on the Threadripper.

https://stackoverflow.com/questions/25388139/r-parallel-computing-and-zombie-processes

via future

Using future instead of parallel might give me more control over this, but it has to done before launching the processes. https://github.com/HenrikBengtsson/future/issues/93

via Docker

To kill them, the only effective way I've found is to restart the container:

docker stop <id>
docker start <id>

@Al-Murphy probably relevant to you too

bschilder commented 1 year ago

Splitting did not help, still caused a huge explosion of memory usage.

Found these tutorials on parallelising in R: https://yxue-me.com/post/2019-05-12-a-glossary-of-parallel-computing-packages-in-r-2019/ https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html

future tutorial: https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html

Forking actually copies the entire R environment, not just the variables inside each loop.

bschilder commented 1 year ago

Packages:

bschilder commented 1 year ago

furrr seems to have some useful methods, like future_map2 which has a built in progress bar and works directly with future: https://furrr.futureverse.org/ http://zevross.com/blog/2019/02/12/dramatically-speed-up-your-r-purrr-functions-with-the-furrr-package/

bschilder commented 1 year ago

Testing methods

1. parallel::mclapply

This is the least efficient method, as I include the gene_data_split inside the loop which increases the memory usage by ~1Gb every 2 seconds or so.

  overlap <- parallel::mclapply(seq_len(length(gene_data_split)),
                                function(i){
    list_name <- names(gene_data_split)[[i]]

    d <- gene_data_split[[i]]
    gen_overlap_test(ct_genes = ct_genes,
                     list_name = list_name,
                     dgenes = unique(d$Gene),
                     long_format = long_format,
                     bg = bg,
                     verbose = FALSE)
  }, mc.cores = cores, mc.) |>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column)

2. furrr::future_map2

This is more efficient due to not passing gene_data_split into the loop. But memory usage is still increasing by ~.3Gb every 2 seconds or so.

 future::plan(future::multisession, workers = cores)
  overlap <- furrr::future_map2(.x = gene_data_split,
                     .y = names(gene_data_split),
                     .progress = TRUE,
                     .f = function(.x,.y){
                       gen_overlap_test(ct_genes = ct_genes,
                                        list_name = .y,
                                        dgenes = unique(.x$Gene),
                                        long_format = long_format,
                                        bg = bg,
                                        verbose = FALSE)
                     })|>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column) 

3. furrr::future_map

Only passing in .x, and not .y. So is a bit more efficient than future_map. Increases mem usage by ~.1-.3Gb/2s.

 future::plan(future::multisession, workers = cores)
  overlap <- furrr::future_map(.x = gene_data_split,
                     .progress = TRUE,
                     .f = function(.x){
                       gen_overlap_test(ct_genes = ct_genes,
                                        dgenes = unique(.x$Gene),
                                        long_format = long_format,
                                        bg = bg,
                                        verbose = FALSE)
                     })|>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column)

Further improvements in efficiency were achieved through reducing the size of the ct_genes obj (only store gene names, not quantiles), and not storing the union gene set in the returned results object. This gets the memory usage to only grow at ~.1Gb/5-14 seconds. Much better! At this rate, I might actually be able to finish all iterations before memory maxes out at >252Gb. I also noticed this rate tends to get a bit slower as time goes on.

4. data.table

A non-parallel option that is still very fast and does not consume increasing amounts of mem over time. Takes advantage of the by and performs the grouped function calls within-object, which is more efficient that writing an lapply across a list of chunks.

overlap <- gene_data[,gen_overlap_test(ct_genes = ct_genes,
                                       list_name = get(list_name_column),
                                       dgenes = Gene,
                                       long_format = long_format,
                                       bg = bg),
                     by=c(list_name_column)]

Parallelising grouped operating within a data.table has been discussed as well: https://github.com/Rdatatable/data.table/issues/5376 https://github.com/Rdatatable/data.table/issues/5077 https://github.com/Rdatatable/data.table/issues/4284 https://github.com/Rdatatable/data.table/issues/4077

It's also worth noting that you can use keyby instead of by. This seems to have more to do with how the table is ordered afterwards (via setting the key) than any performance diffs when doing grouped operations: https://stackoverflow.com/questions/61755966/data-table-summarizing-data-difference-between-by-and-keyby https://github.com/Rdatatable/data.table/issues/5602

Killing zombies

Using future-based packages like furrr opens up the possibility of freeing up the zombified cores by simply running this command. This means i don't have to restart my entire R session each time!

fut <- future::plan(future::multisession, workers = cores)

future::resetWorkers(fut)

This even works if I forgot to save the plan as a variable

future::plan(future::multisession, workers = cores)

future::resetWorkers([fut](future::plan(future::multisession, workers = cores)))

Shared memory

Another thing to keep in mind here is that the threadripper is different from other kinds of clusters in that it is a shared memory resource. This complicates things and more care needs to be taken to avoid issues due to this.