rcastelo / GSVA

Gene set variation analysis
198 stars 40 forks source link

GSVA() parallelization over multiple node #91

Open YouriTasse opened 1 year ago

YouriTasse commented 1 year ago

Hi !

In the GSVA() function, we can use the parallel.sz argument to parallelize the task on multiple cores. This works really well but it limits you to the number of core available on your computer or on a single node (if you run the function on a cluster)

On the cluster I'm using, each node is limited to 40 physical cores. With parallel.sz, I can divide the 100 tasks to 38 workers so that the task is processed in 3 "batch" (i.e. 1-38, 39-76, 77-100). But each "batch" run for 4-5 hours so the total process takes 12 to 14 hours.

I could divide the total running time by 3x to reach 4-5 hours if I could divide the task to 100 workers. But, in order to do this, I would need the BatchtoolsParam() function from BiocParallel instead of the MulticoresParam() that is already implemented. I know that it is already possible to call for BatchtoolsParam() using the BPPARAM argument of the GSVA() function but I simply can't make it work.

The HPC environment use a SLURM scheduler. it would be great to be able to divide the task to 100 workers who are on a determined or undetermined number of nodes (i.e. 25 core x 4 node VS 100 cores x * node)

I have tried many different approaches so far but here is the code associated with my last attempt.

library(GSVA)
library(readxl)
library(dplyr)
library(tidyr)
library(xlsx)
library(BiocParallel)
library(batchtools)

# load the dataset 
load("./scRNAseq.RData")
data = mat %>% as.matrix()

## geneset from excel so that it match onkline version perfectly 
gene.excel = read_excel("./1.xlsx") %>% as.data.frame()

  # set the porper format
  genes = gene.excel[,1] %>% na.omit() %>% as.data.frame()
  genes = list(as.character(genes[,1]))

template = system.file( package="BiocParallel", "unitTests","test_script","test-sge-template.tmpl" )
#batchtoolsTemplate("slurm")

### GSVA
gsva1 = gsva(data, #must be a matrix
             genes, 
             verbose = TRUE,
             method = "gsva",
             BPPARAM = BatchtoolsParam(workers=100,
                                       cluster="slurm",
                                       template=template
                                       )) %>% t()

I would like to mention that it is my first attempt at multi-node parallelization...

Thanks in advance for your support !

Youri

rcastelo commented 1 year ago

hi, I have not used BiocParallel with a BatchtoolsParam object over a cluster, have you looked in the package BiocParallel, the vignette entitled "Introduction to BatchtoolsParam"?

I've googled a bit and found a working example by Nitesh Turaga at https://github.com/nturaga/biocparallel_singularity based on using singularity containers, but probably you can easily skip over the container details. Quickly reading over the documentation, it seems that a key ingredient should be the SLURM template file. If you give a try and get an error, I can try to help you figuring out what the problem may be.