Clusters for max 5000 cells

julien-roux commented 5 years ago

This issue is discussed in part here: https://github.com/hemberg-lab/SC3/issues/61

I have a dataset of of 31,953 cells from a 10X genomics experiment that I loaded using the dropletUtils package. After replacing the logcounts sparse matrix by a plain matrix, SC3 runs without error. However only 5,000 of the cells are clustered. As far as I can tell there is no related warning or message in the text output.

What could be the problem? Have you or anyone already managed to cluster a dataset of more than 5,000 cells? Here are my commands and session info:

> sce <- sc3(sce, ks = 5:12, biology = FALSE, gene_filter = F)
> logcounts(sce) <- as.matrix(logcounts(sce))
> rowData(sce)$feature_symbol <- rowData(sce)$symbol

> table(is.na(sce$sc3_5_clusters))
FALSE  TRUE 
 5000 26953 

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /scicore/soft/apps/OpenBLAS/0.2.13-GCC-4.8.4-LAPACK-3.5.0/lib/libopenblas_prescottp-r0.2.13.so

locale:
 [1] LC_CTYPE=C                 LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SC3_1.8.0                   BiocInstaller_1.30.0       
 [3] scater_1.8.3                SingleCellExperiment_1.2.0 
 [5] SummarizedExperiment_1.10.1 DelayedArray_0.6.4         
 [7] BiocParallel_1.14.2         matrixStats_0.54.0         
 [9] GenomicRanges_1.32.6        GenomeInfoDb_1.16.0        
[11] IRanges_2.14.10             S4Vectors_0.18.3           
[13] ggplot2_3.0.0               Biobase_2.40.0             
[15] BiocGenerics_0.26.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             RColorBrewer_1.1-2       doParallel_1.0.11       
 [4] tools_3.5.0              doRNG_1.7.1              R6_2.2.2                
 [7] KernSmooth_2.23-15       vipor_0.4.5              lazyeval_0.2.1          
[10] colorspace_1.3-2         withr_2.1.2              tidyselect_0.2.4        
[13] gridExtra_2.3            compiler_3.5.0           pkgmaker_0.27           
[16] labeling_0.3             caTools_1.17.1.1         scales_1.0.0            
[19] mvtnorm_1.0-8            DEoptimR_1.0-8           robustbase_0.93-2       
[22] stringr_1.3.1            digest_0.6.15            XVector_0.20.0          
[25] rrcov_1.4-4              pkgconfig_2.0.1          htmltools_0.3.6         
[28] bibtex_0.4.2             WriteXLS_4.0.0           limma_3.36.2            
[31] rlang_0.2.1              shiny_1.1.0              DelayedMatrixStats_1.2.0
[34] bindr_0.1.1              gtools_3.8.1             dplyr_0.7.6             
[37] RCurl_1.95-4.11          magrittr_1.5             GenomeInfoDbData_1.1.0  
[40] Matrix_1.2-14            Rcpp_0.12.18             ggbeeswarm_0.6.0        
[43] munsell_0.5.0            Rhdf5lib_1.2.1           viridis_0.5.1           
[46] stringi_1.2.4            edgeR_3.22.3             zlibbioc_1.26.0         
[49] rhdf5_2.24.0             gplots_3.0.1             plyr_1.8.4              
[52] grid_3.5.0               gdata_2.18.0             promises_1.0.1          
[55] shinydashboard_0.7.0     crayon_1.3.4             lattice_0.20-35         
[58] cowplot_0.9.3            locfit_1.5-9.1           pillar_1.3.0            
[61] rjson_0.2.20             rngtools_1.3.1           reshape2_1.4.3          
[64] codetools_0.2-15         glue_1.3.0               data.table_1.11.4       
[67] httpuv_1.4.5             foreach_1.4.4            gtable_0.2.0            
[70] purrr_0.2.5              assertthat_0.2.0         mime_0.5                
[73] xtable_1.8-2             e1071_1.7-0              later_0.7.3             
[76] pcaPP_1.9-73             class_7.3-14             viridisLite_0.3.0       
[79] tibble_1.4.2             pheatmap_1.0.10          iterators_1.0.10        
[82] registry_0.5             beeswarm_0.2.3           tximport_1.8.0          
[85] bindrcpp_0.2.2           cluster_2.0.7-1          ROCR_1.0-7

mhemberg commented 5 years ago

For large datasets, Sc3 uses a hybrid strategy whereby a random subsets of the cells (default is 5000) are clustered as normal. Then an SVM classifier can be trained to make predictions for the cell-types for the remaining cells. You can adjust this threshold using prepare_for_svm command and you can run the SVM to obtain predictions using sc3_run_svm command. Please see the reference manual and the last section of the vignette for additional information and an example.

julien-roux commented 5 years ago

Oh I didn't realize this was a separate step, sorry!

Out of curiosity, why did you implement this hybrid strategy? Because clustering >5000 cells take disproportionate amount of time? Or memory?

hemberg-lab / SC3

Clusters for max 5000 cells #77