rcastelo / GSVA

Gene set variation analysis
200 stars 40 forks source link

Negative length vectors are not allowed for large dataset #217

Open Zhixuan-Jing opened 2 weeks ago

Zhixuan-Jing commented 2 weeks ago

I implemented gsva on a small bulk RNA dataset and it worked well. However, when I implemented it on a large single-cell dataset along with msigdb genesets, the error occurred. My code and data type is shown below:

# cancer is a Seurat object with about 690,000 cells exp <- cancer@assays[['RNA']]$counts exp <- as.matrix(exp) hEMT <- c('ACKR3', 'ADM', ...) gs_list <- list(c(hEMT)) names(gs_list) <- c("hEMT") gsva_par <- gsvaParam(exp, gs_list, kcdf = 'Gaussian', minSize = 15, maxSize = 500, maxDiff = TRUE) gsva_es <- GSVA::gsva(gsva_par)

and the results are as follow: image

rcastelo commented 1 week ago

Dear @Zhixuan-Jing yesterday we released a new version 2.0 of GSVA (see https://bioconductor.org/install for installation instructions), which has a specific "sparse" regime that allows GSVA to efficiently deal with single-cell data stored in either dgCMatrix objects, or in SingleCellExperiment objects that use dgCMatrix objects to store their assay data. With respect to the specific code that you are showing, I'd say that, according to the Seurat wiki, you need to grab the data slot that should contain the log-normalized counts. If that slot is a dgCMatrix, then you should be able to build the parameter object with that dgCMatrix object, and you only need to specify further your gene sets, and minimum and maximum sizes that you want to analyze from those gene sets. The BPPARAM parameter should allow you to parallelize calculations. Please consult the help page of gsvaParam() and do not hesitate to contact back in case of problems or questions.