stemangiola / tidySingleCellExperiment

Brings SingleCellExperiment objects to the tidyverse
https://stemangiola.github.io/tidySingleCellExperiment/index.html
34 stars 7 forks source link

aggregate_cells takes too long #110

Open MaximilianNuber opened 3 days ago

MaximilianNuber commented 3 days ago

Dear Dr. Mangiola,

Thank you for the very nice package. I am working with large scale single cell RNA seq data and wnat to use tidySingleCellExperiment. I discovered that aggregate_cells takes very long, as compared to aggregateAcrossCells.

As I am usually working on a server, I recreated the problem with a 225k cell dataset on my laptop: https://cellxgene.cziscience.com/e/dea717d4-7bc0-4e46-950f-fd7e1cc8df7d.cxg/

require(tidySingleCellExperiment)
require(tidySummarizedExperiment)
#setwd("/Users/maximiliannuber/Documents/CSAMA_2024")
sce <- readr::read_rds("Seurat_kidney.rds")
sce <- as.SingleCellExperiment(sce)

aggregateAcrossCells runs fast:

system.time(pbulk <- aggregateAcrossCells(sce, ids = colData(sce)[, c("donor_id", "cell_type")]))
 user  system elapsed 
 11.690   2.481  16.056 

This code ran very long and I interrupted after about 10 minutes.

system.time(pbulk <- aggregateAcrossCells(sce, ids = colData(sce)[, c("donor_id", "cell_type")]))

I looked at this with Michael Love, and we found this may be an issue with the combination of donor and cell type. This code took just a few seconds:

system.time(

        pbulk <- sce %>% 
        aggregate_cells(cell_type, assays="counts")

        )
 user  system elapsed 
 10.164   2.333  13.953 

Thank you for any help!

output of sessionInfo:

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.2.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidySummarizedExperiment_1.14.0 ttservice_0.4.1                
 [3] tidyr_1.3.1                     tidySingleCellExperiment_1.14.0
 [5] muscData_1.18.0                 ExperimentHub_2.12.0           
 [7] AnnotationHub_3.12.0            BiocFileCache_2.12.0           
 [9] dbplyr_2.5.0                    rpx_2.12.0                     
[11] edgeR_4.2.0                     stringr_1.5.1                  
[13] pheatmap_1.0.12                 celldex_1.14.0                 
[15] SingleR_2.6.0                   igraph_2.0.3                   
[17] GGally_2.2.1                    NewWave_1.14.0                 
[19] scry_1.16.0                     scDblFinder_1.18.0             
[21] scran_1.32.0                    scater_1.32.0                  
[23] ggplot2_3.5.1                   EnsDb.Hsapiens.v86_2.99.0      
[25] ensembldb_2.28.0                AnnotationFilter_1.28.0        
[27] GenomicFeatures_1.56.0          AnnotationDbi_1.66.0           
[29] scuttle_1.14.0                  DropletUtils_1.24.0            
[31] SingleCellExperiment_1.26.0     SummarizedExperiment_1.34.0    
[33] GenomicRanges_1.56.0            GenomeInfoDb_1.40.0            
[35] IRanges_2.38.0                  S4Vectors_0.42.0               
[37] MatrixGenerics_1.16.0           matrixStats_1.3.0              
[39] DropletTestFiles_1.14.0         dplyr_1.1.4                    
[41] limma_3.60.3                    RcppSpdlog_0.0.17              
[43] Seurat_5.0.3                    cellxgene.census_1.14.1        
[45] SeuratObject_5.0.1              sp_2.1-4                       
[47] GEOquery_2.72.0                 Biobase_2.64.0                 
[49] BiocGenerics_0.50.0            

loaded via a namespace (and not attached):
  [1] R.methodsS3_1.8.2         vroom_1.6.5               RcppCCTZ_0.2.12          
  [4] spdl_0.0.5                goftest_1.2-3             Biostrings_2.72.1        
  [7] HDF5Array_1.32.0          vctrs_0.6.5               spatstat.random_3.2-3    
 [10] digest_0.6.35             png_0.1-8                 aws.signature_0.6.0      
 [13] gypsum_1.0.1              tiledb_0.27.0             ggrepel_0.9.5            
 [16] deldir_2.0-4              parallelly_1.37.1         MASS_7.3-60.2            
 [19] reshape2_1.4.4            httpuv_1.6.15             withr_3.0.0              
 [22] xfun_0.43                 aws.s3_0.3.21             ellipsis_0.3.2           
 [25] survival_3.5-8            memoise_2.0.1             ggbeeswarm_0.7.2         
 [28] zoo_1.8-12                pbapply_1.7-2             R.oo_1.26.0              
 [31] KEGGREST_1.44.1           promises_1.3.0            httr_1.4.7               
 [34] restfulr_0.0.15           globals_0.16.3            fitdistrplus_1.1-11      
 [37] rhdf5filters_1.16.0       ps_1.7.6                  rhdf5_2.48.0             
 [40] rstudioapi_0.16.0         nanotime_0.3.7            UCSC.utils_1.0.0         
 [43] miniUI_0.1.1.1            generics_0.1.3            processx_3.8.4           
 [46] base64enc_0.1-3           curl_5.2.1                zlibbioc_1.50.0          
 [49] ScaledMatrix_1.12.0       polyclip_1.10-6           glmpca_0.2.0             
 [52] GenomeInfoDbData_1.2.12   SparseArray_1.4.3         desc_1.4.3               
 [55] xtable_1.8-4              evaluate_0.23             S4Arrays_1.4.0           
 [58] hms_1.1.3                 irlba_2.3.5.1             colorspace_2.1-0         
 [61] filelock_1.0.3            ROCR_1.0-11               reticulate_1.36.1        
 [64] spatstat.data_3.0-4       magrittr_2.0.3            lmtest_0.9-40            
 [67] readr_2.1.5               nanoarrow_0.4.0.1         later_1.3.2              
 [70] viridis_0.6.5             lattice_0.22-6            spatstat.geom_3.2-9      
 [73] future.apply_1.11.2       scattermore_1.2           XML_3.99-0.16.1          
 [76] triebeard_0.4.1           cowplot_1.1.3             RcppAnnoy_0.0.22         
 [79] pillar_1.9.0              nlme_3.1-164              sna_2.7-2                
 [82] compiler_4.4.0            beachmat_2.20.0           RSpectra_0.16-1          
 [85] stringi_1.8.3             tensor_1.5                GenomicAlignments_1.40.0 
 [88] plyr_1.8.9                crayon_1.5.2              abind_1.4-5              
 [91] BiocIO_1.14.0             locfit_1.5-9.9            bit_4.0.5                
 [94] codetools_0.2-20          BiocSingular_1.20.0       alabaster.ranges_1.4.1   
 [97] plotly_4.10.4             mime_0.12                 intergraph_2.0-4         
[100] splines_4.4.0             Rcpp_1.0.12               fastDummies_1.7.3        
[103] sparseMatrixStats_1.16.0  knitr_1.46                blob_1.2.4               
[106] utf8_1.2.4                BiocVersion_3.19.1        fs_1.6.4                 
[109] listenv_0.9.1             DelayedMatrixStats_1.26.0 pkgbuild_1.4.4           
[112] tibble_3.2.1              Matrix_1.7-0              callr_3.7.6              
[115] statmod_1.5.0             tzdb_0.4.0                network_1.18.2           
[118] pkgconfig_2.0.3           tools_4.4.0               cachem_1.0.8             
[121] RSQLite_2.3.7             viridisLite_0.4.2         DBI_1.2.2                
[124] fastmap_1.1.1             rmarkdown_2.26            scales_1.3.0             
[127] grid_4.4.0                ica_1.0-3                 Rsamtools_2.20.0         
[130] coda_0.19-4.1             patchwork_1.2.0           ggstats_0.6.0            
[133] BiocManager_1.30.23       dotCall64_1.1-1           alabaster.schemas_1.4.0  
[136] RANN_2.6.1                farver_2.1.1              yaml_2.3.8               
[139] rtracklayer_1.64.0        cli_3.6.2                 purrr_1.0.2              
[142] leiden_0.4.3.1            lifecycle_1.0.4           uwot_0.2.2               
[145] arrow_16.1.0              bluster_1.14.0            BiocParallel_1.38.0      
[148] gtable_0.3.5              rjson_0.2.21              ggridges_0.5.6           
[151] progressr_0.14.0          parallel_4.4.0            jsonlite_1.8.8           
[154] RcppHNSW_0.6.0            bitops_1.0-7              bit64_4.0.5              
[157] assertthat_0.2.1          xgboost_1.7.7.1           Rtsne_0.17               
[160] alabaster.matrix_1.4.1    spatstat.utils_3.0-4      BiocNeighbors_1.22.0     
[163] urltools_1.7.3            alabaster.se_1.4.1        metapod_1.12.0           
[166] dqrng_0.3.2               R.utils_2.12.3            alabaster.base_1.4.1     
[169] lazyeval_0.2.2            shiny_1.8.1.1             htmltools_0.5.8.1        
[172] sctransform_0.4.1         rappdirs_0.3.3            glue_1.7.0               
[175] spam_2.10-0               httr2_1.0.1               XVector_0.44.0           
[178] RCurl_1.98-1.14           gridExtra_2.3             tiledbsoma_1.11.1        
[181] R6_2.5.1                  DESeq2_1.44.0             labeling_0.4.3           
[184] SharedObject_1.18.0       cluster_2.1.6             pkgload_1.3.4            
[187] Rhdf5lib_1.26.0           statnet.common_4.9.0      DelayedArray_0.30.1      
[190] tidyselect_1.2.1          vipor_0.4.7               ProtGenerics_1.36.0      
[193] xml2_1.3.6                future_1.33.2             rsvd_1.0.5               
[196] munsell_0.5.1             KernSmooth_2.23-22        data.table_1.15.4        
[199] htmlwidgets_1.6.4         RColorBrewer_1.1-3        rlang_1.1.3              
[202] spatstat.sparse_3.0-3     spatstat.explore_3.2-7    remotes_2.5.0            
[205] fansi_1.0.6               beeswarm_0.4.0    

Thanks!

MaximilianNuber commented 3 days ago

My apologies. I copied the wrong chunk for the actual example. This following chunk takes longer than 10 min.:

system.time(pbulk <- sce %>% 
        aggregate_cells(c(donor_id, cell_type), assays="counts"))