as.matrix conversion: memory issues

Daliya-K commented 4 years ago

Hi, I am encountering memory problems when I try to convert my Negative ADT from sparse matrix to matrix:

neg_adt_matrix = GetAssayData(NegativeObject, assay = "ADT", slot = 'counts') %>% as.matrix() Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Here, because no hashing was performed, I used as negative all droplets with total UMI counts between 0 and 40 and the resulting matrix is 294 x 20189901.

Could you suggest some workaround? I could probably select a smaller number of empty droplets, but I am not sure if it will be representative sample in this case?

Sessioninfo: R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /home/daliya/anaconda3/lib/libmkl_rt.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_BE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_BE.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_BE.UTF-8 LC_NAME=de_BE.UTF-8 LC_ADDRESS=de_BE.UTF-8 LC_TELEPHONE=de_BE.UTF-8
[11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=de_BE.UTF-8

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] VennDiagram_1.6.20 futile.logger_1.4.3 dsb_0.1.0 Matrix_1.2-18 plyr_1.8.6 ggrepel_0.8.2 xlsx_0.6.3
[8] cowplot_1.0.0 SoupX_1.4.5 dplyr_1.0.0 Seurat_3.2.0 ggplot2_3.3.2

loaded via a namespace (and not attached): [1] backports_1.1.8 igraph_1.2.5 lazyeval_0.2.2 splines_4.0.2 listenv_0.8.0 usethis_1.6.1 digest_0.6.25
[8] htmltools_0.5.0 fansi_0.4.1 magrittr_1.5 memoise_1.1.0 tensor_1.5 cluster_2.1.0 ROCR_1.0-11
[15] limma_3.44.3 remotes_2.1.1 globals_0.12.5 prettyunits_1.1.1 colorspace_1.4-1 rappdirs_0.3.1 xfun_0.15
[22] callr_3.4.3 crayon_1.3.4 jsonlite_1.7.0 textTinyR_1.1.3 spatstat_1.64-1 spatstat.data_1.4-3 survival_3.1-12
[29] zoo_1.8-8 ape_5.4 glue_1.4.1 polyclip_1.10-0 gtable_0.3.0 leiden_0.3.3 pkgbuild_1.1.0
[36] future.apply_1.6.0 abind_1.4-5 scales_1.1.1 futile.options_1.0.1 miniUI_0.1.1.1 Rcpp_1.0.5 viridisLite_0.3.0
[43] xtable_1.8-4 reticulate_1.16 rsvd_1.0.3 mclust_5.4.6 htmlwidgets_1.5.1 httr_1.4.1 RColorBrewer_1.1-2
[50] ellipsis_0.3.1 ica_1.0-2 pkgconfig_2.0.3 rJava_0.9-13 farver_2.0.3 uwot_0.1.8 deldir_0.1-28
[57] utf8_1.1.4 tidyselect_1.1.0 labeling_0.3 rlang_0.4.7 reshape2_1.4.4 later_1.1.0.1 munsell_0.5.0
[64] tools_4.0.2 cli_2.0.2 generics_0.0.2 devtools_2.3.0 ggridges_0.5.2 evaluate_0.14 stringr_1.4.0
[71] fastmap_1.0.1 yaml_2.2.1 goftest_1.2-2 processx_3.4.3 knitr_1.29 fs_1.4.2 fitdistrplus_1.1-1
[78] purrr_0.3.4 RANN_2.6.1 pbapply_1.4-2 future_1.18.0 nlme_3.1-147 mime_0.9 formatR_1.7
[85] compiler_4.0.2 rstudioapi_0.11 plotly_4.9.2.1 curl_4.3 png_0.1-7 testthat_2.3.2 spatstat.utils_1.17-0 [92] tibble_3.0.3 stringi_1.4.6 ps_1.3.3 desc_1.2.0 lattice_0.20-41 vctrs_0.3.2 pillar_1.4.6
[99] lifecycle_0.2.0 lmtest_0.9-37 RcppAnnoy_0.0.16 data.table_1.12.8 irlba_2.3.3 httpuv_1.5.4 patchwork_1.0.1.9000 [106] R6_2.4.1 promises_1.1.1 KernSmooth_2.23-17 gridExtra_2.3 sessioninfo_1.1.1 codetools_0.2-16 lambda.r_1.2.4
[113] MASS_7.3-51.6 assertthat_0.2.1 pkgload_1.1.0 xlsxjars_0.6.1 rprojroot_1.3-2 withr_2.2.0 sctransform_0.2.1
[120] mgcv_1.8-31 parallel_4.0.2 rpart_4.1-15 tidyr_1.1.0 rmarkdown_2.3 Rtsne_0.15 shiny_1.5.0
[127] base64enc_0.1-3

Thanks! Daliya

MattPM commented 4 years ago

Hi @Daliya-K Yes you are defining too many negative drops; of those 20189901 "droplets" actually most are theoretical barcodes that are included in the cellranger output but were actually not detected and they likely have no data. You will have to apply some minimal filtering or else R is storing a huge matrix of 0s in memory. Before converting to a regular R matrix, run Matrix::colSums(log10(neg_adt_matrix)) the histogram likely has highest density by far at 0 correct? Those are barcodes in the new 10X assay but we don't have data suggesting they had any molecules of protein or mRNA during library prep so were likely not actually loaded or "seen" by the assay chemistry.

To get an unbiased background estimate you can apply some basic filtering to the raw output object. I recommend you check out the histogram image I posted in my answer to the question posted in https://github.com/MattPM/dsb/issues/9 keeping in mind "region1" corresponds to droplets with no data so I define negatives with "region 2" the second peak in the log mRNA library size. Using mRNA provides some unbiased estimation because you are not thresholding with the protein library.

. I will be updating the package documentation to be more simple with some guidance on selecting background. Here is a minimal example using 10x example data: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3 and Seurat version 3.

library(dsb)
library(Seurat) # version 3 in example provided belo
library(tidyverse)
library(magrittr)
path_to_data = "data/10x_data/10x_pbmc5k_V3/raw_feature_bc_matrix/"
raw = Read10X(data.dir = path_to_data)

# create object with Minimal filtering (retain drops with 5 unique mRNAs detected)
s1 = CreateSeuratObject(counts = raw$`Gene Expression`,  min.cells = 10, min.features = 5)
# add some metadata 
s1$log10umi = log10(s1$nCount_RNA  + 1) 
s1$bc = rownames(s1@meta.data)

# define negative and positive drops based on an mRNA threshold to define neg and positive cells (see details below) 
hist(log10(s1$nCount_RNA  + 1), breaks = 1000)
neg_drops = s1@meta.data %>% filter(log10umi > 1.5 & log10umi < 2.79) %$% bc
positive_cells = s1@meta.data %>% filter(log10umi > 2.8 & log10umi < 4.4) %$% bc

# Subset protein data to create a standard R matrix of protein counts for negative droplets and cells 
neg_prot = raw$`Antibody Capture`[ ,  neg_drops ] %>% as.matrix()
positive_prot = raw$`Antibody Capture`[ , positive_cells] %>% as.matrix()

# run DSB normalization
isotypes = rownames(positive_prot)[30:32]
mtx = DSBNormalizeProtein(cell_protein_matrix = positive_prot,
                           empty_drop_matrix = neg_prot,
                           denoise.counts = TRUE, use.isotype.control = TRUE, 
                           isotype.control.name.vec = isotypes)

MattPM commented 4 years ago

And also please do let me know if the above suggestion allows you to define background that fits into memory and normalize with DSB. Thanks, -M

Daliya-K commented 4 years ago

Hi Matthew,

Thanks so much for the super fast and extensive reply! I was able to apply your strategy without any further memory issues. I just wanted to make one more remark. I've got the same error as in https://github.com/MattPM/dsb/issues/6, so I applied some filtering of my ADT features (0.9 quantile cutoff). There were indeed quite some features that were very close to 0. But I noticed, that I needed to exclude those features not only from the positive adt matrix, but also from the negative one, otherwise I kept receiving the same error. I mention this as your coment was:

update: @naity this error occurs during denoising, when you have antibodies in the cells (not the empty drops) that have close to zero counts across all cells.

Best, Daliya

MattPM commented 4 years ago

Great! That makes sense re: filtering features the negative drops and cell containing drops need to have the same features (proteins) as rows. With big panels this will likely be an important step.

I updated the main readme for the package to make it more clear how to define the negatives without hashing data.

Daliya-K commented 4 years ago

Thanks a lot for the help!

niaid / dsb

as.matrix conversion: memory issues #10