Integrating large data set question

s2hui commented 3 years ago

Hello,

I have a 55 single cell data sets I would like to integrate (consisting of over 200K cells). Each data sets belongs to 1 of 6 histologies in the disease we are studying.

An initial rpca integration using 6 reference ran out of memory (running with 1T mem, 1 node, 1 core).

> tail -f SubmitIntegrate-v3.sh-2311581.out
Finding integration vector weights
Integrating data
Integrating dataset 55 with reference dataset
Finding integration vectors
Finding integration vector weights
Integrating data
Error in .cbind2Csp(x, y) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
  Calls: IntegrateData ... cbind -> cbind2 -> cbind2 -> cbind2sparse -> .cbind2Csp
  Execution halted

Previously, I had integrated successfully using rpca with 25 data sets (100K cells, 1 histology) (running with 180G mem, 1 node, 1 core).

Rough code:

so_features <- SelectIntegrationFeatures(object.list = so_list, nfeatures = 2000)
so_list <- PrepSCTIntegration(object.list = so_list, anchor.features = so_features, verbose = FALSE)
so_list <- lapply(X = so_list, FUN = RunPCA, verbose = FALSE, features = so_features)
so_anchors <- FindIntegrationAnchors(object.list = so_list, normalization.method = "SCT", anchor.features = so_features, reference=c(26,48,41,37,40,29), reduction = "rpca", verbose = FALSE)
all_genes <- lapply(so_list, row.names) %>% Reduce(intersect, .) # get gene names present in ALL SCTransform'd datasets
so_integrated <- IntegrateData(anchorset = so_anchors, normalization.method = "SCT", verbose = TRUE, features.to.integrate = all_genes)

I am wondering is there is anything I can do to address the memory (cholmod problem too large) issue. For example I use 6 data set references (~30K cells), but maybe I should reduce this?

I have noticed that others have run integration successfully on 500K cells #3889 (albeit integrating 2 data sets).

Thanks for any insight, shui

timoast commented 3 years ago

I would recommend integrating only the variable genes rather than all genes, this should substantially reduce the memory requirements. Typically the integrated data is used to compute a new PCA, in which case you only need the variable genes.

s2hui commented 3 years ago

Thanks your suggestion worked!

hanhyebin commented 2 years ago

Hi, I have a follow-up question regarding this: which function & argument should I adjust to integrate only variable genes for SCTransform?

I have a similar code as @s2hui had

list <- lapply(X = list, FUN = SCTransform, method="glmGamPoi", residual.features = ) 
features <- SelectIntegrationFeatures(object.list = list, nfeatures = 3000)
list <- PrepSCTIntegration(object.list = list, anchor.features = features)
list <- lapply(X = list, FUN = RunPCA, features = features)

anchors<- FindIntegrationAnchors(object.list = list, anchor.features = features, 
                                 normalization.method = "SCT", reduction = "rpca", k.anchor = 5)  

combo <- IntegrateData(anchorset = anchors, normalization.method = "SCT", dims =  1:30)

satijalab / seurat

Integrating large data set question #4419