satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 908 forks source link

clustering on very large data #5445

Closed xiyupeng closed 2 years ago

xiyupeng commented 2 years ago

Hello, Seurat developers !

I am working on flow data, which are single cell data with only dozens of markers. I want to do clustering on a dataset with 10M+ cells, but I currently test the pipeline on small dataset. I plan to use the default Louvain method for clustering and below are my parameter setting.

pool_X50 <- RunUMAP(pool_X50, dims = 1:26)
pool_X50 <- FindNeighbors(pool_X50, dims = 1:26, nn.eps = 0.5, k.param = 10)
pool_X50 <- FindClusters(pool_X50, resolution = 0.5)

It works well on a subset of 1M cells. The FindClusters() is the most time consuming step ( I am so surprised) and it takes about 2 hours with 1M cells. But when it comes to 3M cells, it already runs about 4 days. UMAP takes about 2.5 hours on the same 3M subset. For test, I just use single core and about 40G memory. Below is the output of Louvain method.

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 3152038
Number of edges: 21877627

Running Louvain algorithm...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
Maximum modularity in 10 random starts: 0.9224
Number of communities: 6319
Elapsed time: 9177 seconds
**************************************************|

I am new to Seurat. I wonder which implementation of Louvain method used in the Seurat package and is it scalable to millions of cells ? Previous I thought the SNN graph is the most time and memory consuming step but I was wrong. Do you recommend other scalable clustering algorithm that could be applied to millions of cells ?

Thank you !

Best, Xiyu

torkencz commented 2 years ago

For the clustering algorithm I would refer you to https://satijalab.org/seurat/reference/findclusters or https://www.biostars.org/p/445075/ or https://github.com/satijalab/seurat/issues/3038 . Using multiple cores can help speed things up. Probably the fastest would be using GPU based clustering algorithms that are out there.