GSVA() parallelization over multiple node

Hi !

In the GSVA() function, we can use the parallel.sz argument to parallelize the task on multiple cores. This works really well but it limits you to the number of core available on your computer or on a single node (if you run the function on a cluster)

On the cluster I'm using, each node is limited to 40 physical cores. With parallel.sz, I can divide the 100 tasks to 38 workers so that the task is processed in 3 "batch" (i.e. 1-38, 39-76, 77-100). But each "batch" run for 4-5 hours so the total process takes 12 to 14 hours.

I could divide the total running time by 3x to reach 4-5 hours if I could divide the task to 100 workers. But, in order to do this, I would need the BatchtoolsParam() function from BiocParallel instead of the MulticoresParam() that is already implemented. I know that it is already possible to call for BatchtoolsParam() using the BPPARAM argument of the GSVA() function but I simply can't make it work.

The HPC environment use a SLURM scheduler. it would be great to be able to divide the task to 100 workers who are on a determined or undetermined number of nodes (i.e. 25 core x 4 node VS 100 cores x * node)

I have tried many different approaches so far but here is the code associated with my last attempt.

library(GSVA)
library(readxl)
library(dplyr)
library(tidyr)
library(xlsx)
library(BiocParallel)
library(batchtools)

# load the dataset 
load("./scRNAseq.RData")
data = mat %>% as.matrix()

## geneset from excel so that it match onkline version perfectly 
gene.excel = read_excel("./1.xlsx") %>% as.data.frame()

  # set the porper format
  genes = gene.excel[,1] %>% na.omit() %>% as.data.frame()
  genes = list(as.character(genes[,1]))

template = system.file( package="BiocParallel", "unitTests","test_script","test-sge-template.tmpl" )
#batchtoolsTemplate("slurm")

### GSVA
gsva1 = gsva(data, #must be a matrix
             genes, 
             verbose = TRUE,
             method = "gsva",
             BPPARAM = BatchtoolsParam(workers=100,
                                       cluster="slurm",
                                       template=template
                                       )) %>% t()

I would like to mention that it is my first attempt at multi-node parallelization...

Thanks in advance for your support !

Youri

rcastelo / GSVA

GSVA() parallelization over multiple node #91