CLR transformation on RNA counts

amberbangma commented 3 years ago

Dear all,

I noticed when doing CLR transformation on my RNA counts that a sparse matrix is turned into a regular matrix, causing my object to triple in size.

str(data.blood.log@assays[["RNA"]]@ data) Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ..@ i : int [1:52223298] 32 78 86 128 154 174 190 274 315 346 ... ..@ p : int [1:26699] 0 1472 3804 7280 8788 10476 11882 13489 14791 17434 ... ..@ Dim : int [1:2] 33538 26698 ..@ Dimnames:List of 2 .. ..$ : chr [1:33538] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ... .. ..$ : chr [1:26698] "18_lane1_AAACCCACAAAGGGTC-1" "18_lane1_AAACCCACAACGTTAC-1" "18_lane1_AAACCCACACAAGGTG-1" "18_lane1_AAACCCACACATTCGA-1" ... ..@ x : num [1:52223298] 1.44 1.44 1.44 1.44 2.37 ... ..@ factors : list()

str(data.blood.clr@assays[["RNA"]]@ data) num [1:33538, 1:26698] 0 0 0 0 0 0 0 0 0 0 ... attr(*, "dimnames")=List of 2 ..$ : chr [1:33538] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ... ..$ : chr [1:26698] "18_lane1_AAACCCACAAAGGGTC-1" "18_lane1_AAACCCACAACGTTAC-1" "18_lane1_AAACCCACACAAGGTG-1" "18_lane1_AAACCCACACATTCGA-1" ...

This also gives an error when normalizing my larger datasets.

data.biopsies An object of class Seurat 60270 features across 70198 samples within 4 assays Active assay: RNA (33538 features, 0 variable features) 3 other assays present: HTO, SCT, integrated 2 dimensional reductions calculated: pca, umap

data.biopsies <- NormalizeData(data.biopsies, normalization.method = "CLR") Normalizing across features Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Would it be possible to change this in the seurat normalizedata function, so CLR is also possible for larger datasets and more memory friendly? Or do you have a solution to do CLR on large single cell datasets and afterwards still analyze it with Seurat?

Thanks! Amber

amberbangma commented 3 years ago

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] nl_NL.UTF-8/nl_NL.UTF-8/nl_NL.UTF-8/C/nl_NL.UTF-8/nl_NL.UTF-8

attached base packages: [1] grid stats4 parallel stats graphics grDevices utils datasets methods
[10] base

other attached packages: [1] VennDiagram_1.6.20 futile.logger_1.4.3 ggalluvial_0.12.3
[4] circlize_0.4.11 ComplexHeatmap_2.7.1.1005 openxlsx_4.2.3
[7] readxl_1.3.1 MAST_1.16.0 SingleCellExperiment_1.12.0 [10] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[13] IRanges_2.24.1 S4Vectors_0.28.1 MatrixGenerics_1.2.0
[16] matrixStats_0.57.0 tidyr_1.1.2 readr_1.4.0
[19] ggplot2_3.3.2 patchwork_1.1.0 Seurat_3.2.3
[22] dplyr_1.0.2 CellChat_0.0.2 Biobase_2.50.0
[25] BiocGenerics_0.36.0

loaded via [1] reticulate_1.18 [5] munsell_0.5.0 [9] miniUI_0.1.1.1 [13] ROCR_1.0-11 [17] labeling_0.4.2 [21] coda_0.19-4 [25] lambda.r_1.2.4 [29] clue_0.3-58 [33] DelayedArray_0.16.0 [37] gtable_0.3.0 [41] rlang_0.4.9 [45] lazyeval_0.2.2 [49] tools_4.0.3 [53] RColorBrewer_1.1-2 [57] zlibbioc_1.36.0 [61] deldir_0.2-3 [65] zoo_1.8-8 [69] magrittr_2.0.1 [73] sna_2.6 [77] hms_0.5.3 [81] shape_1.4.5 [85] crayon_1.3.4 [89] formatR_1.7 [93] igraph_1.2.6 [97] foreach_1.5.1 [101] XVector_0.30.0 [105] RcppAnnoy_0.0.17 [109] leiden_0.3.6 [113] rjson_0.2.20 [117] network_1.16.1 [121] pillar_1.4.7 [125] survival_3.2-7 [129] spatstat_1.64-1 [133] irlba_2.3.3 a namespace (and not attached): tidyselect_1.1.0 htmlwidgets_1.5.3 Rtsne_0.15
codetools_0.2-18 ica_1.0-2 future_1.21.0
withr_2.3.0 colorspace_2.0-0 rstudioapi_0.13
tensor_1.5 listenv_0.8.0 NMF_0.23.0
GenomeInfoDbData_1.2.4 polyclip_1.10-0 farver_2.0.3
parallelly_1.22.0 vctrs_0.3.5 generics_0.1.0
xfun_0.19 R6_2.5.0 doParallel_1.0.16
rsvd_1.0.3 bitops_1.0-6 spatstat.utils_1.17-0 assertthat_0.2.1 promises_1.1.1 scales_1.1.1
Cairo_1.5-12.2 globals_0.14.0 goftest_1.2-2
systemfonts_0.3.2 GlobalOptions_0.1.2 splines_4.0.3
reshape2_1.4.4 abind_1.4-5 httpuv_1.5.4
gridBase_0.4-7 statnet.common_4.4.1 ellipsis_0.3.1
ggridges_0.5.2 Rcpp_1.0.5 plyr_1.8.6
purrr_0.3.4 RCurl_1.98-1.2 rpart_4.1-15
pbapply_1.4-3 GetoptLong_1.0.4 cowplot_1.1.0
ggrepel_0.8.2 cluster_2.1.0 tinytex_0.28
data.table_1.13.4 RSpectra_0.16-0 futile.options_1.0.1
lmtest_0.9-38 RANN_2.6.1 fitdistrplus_1.1-3
mime_0.9 xtable_1.8-4 gridExtra_2.3
compiler_4.0.3 tibble_3.0.4 KernSmooth_2.23-18
htmltools_0.5.0 mgcv_1.8-33 later_1.1.0.1
MASS_7.3-53 Matrix_1.2-18 cli_2.2.0
pkgconfig_2.0.3 registry_0.5-1 plotly_4.9.2.1
svglite_1.2.3.2 rngtools_1.5 pkgmaker_0.32.2
stringr_1.4.0 digest_0.6.27 sctransform_0.3.1
rle_0.9.2 spatstat.data_1.5-2 cellranger_1.1.0
uwot_0.1.9 gdtools_0.2.2 shiny_1.5.0
lifecycle_0.2.0 nlme_3.1-151 jsonlite_1.7.2
viridisLite_0.3.0 limma_3.46.0 fansi_0.4.1
lattice_0.20-41 fastmap_1.0.1 httr_1.4.2
glue_1.4.2 zip_2.1.1 FNN_1.1.3
png_0.1-7 iterators_1.0.13 stringi_1.5.3
future.apply_1.6.0

satijalab commented 3 years ago

This is an inherent issue with CLR normalization for RNA, rather than a mistake in Seurat's implementation. We do not recommend performing CLR normalization on RNA datasets - one reason is because as you say the normalized data is no longer sparse. The normalization strategy is much more effective for CITE-seq data, which is typically non-sparse anyway, and contains a lower number of features.

satijalab / seurat

CLR transformation on RNA counts #3887