satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 907 forks source link

NAs introduced by NormalizeData() #3805

Closed BrianLohman closed 3 years ago

BrianLohman commented 3 years ago

Hello, I am seeing a problem with the NormalizeData() function. I asked this question on #2803, but I think it deserves it's own issue.

It appears that NormalizeData is introducing NAs that result in an error in FindVariableFeatures.

Processing of UCSC data taken from here: https://cellbrowser.readthedocs.io/load.html

### download data
wget https://cells.ucsc.edu/early-brain/exprMatrix.tsv.gz
wget https://cells.ucsc.edu/early-brain/meta.tsv

### setup
mat <- fread("zcat < exprMatrix.tsv.gz")
meta <- read.table("meta.tsv", header=T, sep="\t", as.is=T, row.names=1)
genes = mat[,1][[1]]
genes = gsub(".+[|]", "", genes)
mat = data.frame(mat[,-1], row.names=genes)
so <- CreateSeuratObject(counts = mat, project = "ucsc", meta.data=meta)

### check for NAs, as suggested by issue 2803
sum(is.na(so@assays$RNA@data@x))
[1] 0

> so = NormalizeData(so)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> so = FindVariableFeatures(so, selection.method = "vst", nfeatures = 2000, verbose = TRUE)
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Error in hvf.info$variance.expected[not.const] <- 10^fit$fitted : 
  NAs are not allowed in subscripted assignments

### check for NAs
sum(is.na(so@assays$RNA@data@x))
[1] 204

### check for NAs in expression matrix, prior to building Seurat object
> sum(is.na(mat))
[1] 0

Any help would be appreciated.

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/local/R/4.0.0/lib64/R/lib/libRblas.so
LAPACK: /usr/local/R/4.0.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] splines   parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pheatmap_1.0.12     data.table_1.13.4   cowplot_1.1.0       monocle_2.16.0      DDRTree_0.1.5       irlba_2.3.3         VGAM_1.1-4         
 [8] Matrix_1.2-18       patchwork_1.1.0     readxl_1.3.1        RColorBrewer_1.1-2  hciRdata_1.100      hciR_1.3            fgsea_1.14.0       
[15] Rhdf5lib_1.10.1     limma_3.44.3        forcats_0.5.0       stringr_1.4.0       dplyr_1.0.2         purrr_0.3.4         readr_1.4.0        
[22] tidyr_1.1.2         tibble_3.0.4        tidyverse_1.3.0     ggplot2_3.3.2       matrixStats_0.57.0  Biobase_2.48.0      S4Vectors_0.26.1   
[29] BiocGenerics_0.34.0 BiocStyle_2.16.1    Seurat_3.2.2       

loaded via a namespace (and not attached):
  [1] backports_1.2.0       fastmatch_1.1-0       plyr_1.8.6            igraph_1.2.6          lazyeval_0.2.2        densityClust_0.3      BiocParallel_1.22.0  
  [8] listenv_0.8.0         fastICA_1.2-2         digest_0.6.27         htmltools_0.5.0       viridis_0.5.1         fansi_0.4.1           magrittr_2.0.1       
 [15] tensor_1.5            cluster_2.1.0         ROCR_1.0-11           globals_0.14.0        modelr_0.1.8          docopt_0.7.1          colorspace_2.0-0     
 [22] rvest_0.3.6           ggrepel_0.8.2         haven_2.3.1           xfun_0.19             sparsesvd_0.2         crayon_1.3.4          jsonlite_1.7.1       
 [29] spatstat_1.64-1       spatstat.data_1.5-2   survival_3.2-7        zoo_1.8-8             glue_1.4.2            polyclip_1.10-0       gtable_0.3.0         
 [36] leiden_0.3.6          future.apply_1.6.0    abind_1.4-5           scales_1.1.1          DBI_1.1.0             miniUI_0.1.1.1        Rcpp_1.0.5           
 [43] viridisLite_0.3.0     xtable_1.8-4          reticulate_1.18       rsvd_1.0.3            htmlwidgets_1.5.2     httr_1.4.2            FNN_1.1.3            
 [50] ellipsis_0.3.1        ica_1.0-2             farver_2.0.3          pkgconfig_2.0.3       uwot_0.1.9            dbplyr_2.0.0          deldir_0.2-3         
 [57] utf8_1.1.4            labeling_0.4.2        tidyselect_1.1.0      rlang_0.4.9           reshape2_1.4.4        later_1.1.0.1         munsell_0.5.0        
 [64] cellranger_1.1.0      tools_4.0.0           cli_2.2.0             generics_0.1.0        broom_0.7.2           ggridges_0.5.2        evaluate_0.14        
 [71] fastmap_1.0.1         yaml_2.2.1            goftest_1.2-2         knitr_1.30            fs_1.5.0              fitdistrplus_1.1-3    RANN_2.6.1           
 [78] pbapply_1.4-3         future_1.20.1         nlme_3.1-150          mime_0.9              slam_0.1-48           xml2_1.3.2            compiler_4.0.0       
 [85] rstudioapi_0.13       plotly_4.9.2.1        png_0.1-7             spatstat.utils_1.17-0 reprex_0.3.0          stringi_1.5.3         lattice_0.20-41      
 [92] HSMMSingleCell_1.8.0  vctrs_0.3.5           pillar_1.4.7          lifecycle_0.2.0       BiocManager_1.30.10   combinat_0.0-8        lmtest_0.9-38        
 [99] RcppAnnoy_0.0.17      httpuv_1.5.4          R6_2.5.0              promises_1.1.1        KernSmooth_2.23-18    gridExtra_2.3         parallelly_1.21.0    
[106] codetools_0.2-18      MASS_7.3-53           assertthat_0.2.1      withr_2.3.0           qlcMatrix_0.9.7       sctransform_0.3.1     mgcv_1.8-33          
[113] hms_0.5.3             grid_4.0.0            rpart_4.1-15          rmarkdown_2.5         Rtsne_0.15            shiny_1.5.0           lubridate_1.7.9.2 
timoast commented 3 years ago

This issue here is that the input "count" matrix contains Inf values, and does not appear to be integer counts. Running NormalizeData on values that are already Inf will result in NA. We will add a warning (or maybe throw an error) if infinite values are present in the count matrix.

BrianLohman commented 3 years ago

Hi @timoast ,

Thanks for your reply and for looking into this. I am very surprised that the values are not integers. I will have to ask UCSC what is going on.

Yes, a warning about why NA's are introduced would be much appreciated. Flagging the rows/recording the indices of the rows that throw NAs so they could be inspected would also be most welcome.

Thanks again for your

jjwestfall521 commented 3 years ago

@BrianLohman

I'm working with the same dataset from UCSC. How did you end up resolving this issue? Did you hear back from UCSC?

BrianLohman commented 3 years ago

Hi @jjwestfall521, I emailed the help email that UCSC provides and they connected me to the authors. You might just go straight to the corresponding author's email, it's probably faster.

Clara22 commented 3 years ago

Hi,

I have found the same problem with another dataset and it might be the same for you.

I checked which rows (features) had NAs like this (COV is my Seurat object, RowsNA will contain the names of all the features):

RowsNA-names(which(rowSums(is.na(COV@assays$RNA@counts))0))

I saw all the features with NAs seemed to be antibody names, which I guess they just had for a few samples so not all cells have values for these features, they look like this:

RowsNA [1] "AB-CD80" "AB-CD86" "AB-CD274"
[4] "AB-PDCD1LG2" "AB-ICOSLG" "AB-ITGAM"

so I just removed these rows/features from the object like this:

'%!in%' <- function(x,y)!('%in%'(x,y)) #this is a NOT IN function RowsKEEP<-rownames(COV)[rownames(COV) %!in% RowsNA] COV<-subset(COV,features=RowsKEEP)

and I did not get the error in FindVariableFatures anymore:

COV <- FindVariableFeatures(COV, selection.method = "vst", nfeatures = 6000) Calculating gene variances 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **| Calculating feature variances of standardized and clipped values 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **|

I hope it helps

malcook commented 3 years ago

For the benefit of anyone similarly afflicted...

I ran across this error when I was borrowing Seurat::FindVariableFeatures for another application where the mode of my matrix, not being of counts, was 'numeric'. Rounding it and setting its mode to 'integer' was my workaround.

FWIW, my application is bulk ATAC-seq reads tallied in genomic windows (a la csaw), but post quantile normalization.

Harmony714 commented 2 years ago

@Clara22 hi

thanks for your help

I have encountered the same problem, and solve it by your method

but i still have the question that why this problem occured ?

looking forward to your answer!

Clara22 commented 2 years ago

Hi @Harmony714

In the case of the dataset I mention the NAs were there because the features corresponding to antibodies were not available for all cells. In other words the dataset contained samples with RNAseq data only and samples with protein (from antibodies quantification) and RNAseq (probably from CITEseq), therefore cells corresponding to the former samples had no data (NAs) for the antibody/protein features.

I don't know where do the NAs come from in your case. You may want to inspect which rows and columns contain NAs to investigate the issue in your dataset.

Best, Clara

IrinaVKuznetsova commented 7 months ago

I found NAs introduced after normalisation step in my scRNA data the raw counts data had negative values . For some reason have negative values introduced by cell bender checked cell ranger filt. matrix, it is all good no negative raw counts.

chongjing commented 5 days ago

Sorry to bother everyone, but got same error. I opened issue https://github.com/ZJU-UoE-CCW-LAB/scCDC/issues/9, then got to know comments here. But solution from @Clara22 seems not working. Please note I am using Seurat 5.1.0. Hope someone could help, appreciate.

Here is details:

### read scCDC corrected matrix`
> mislet_seuratobj_corrected <- readRDS("02.AVS_0d_seuratobj_corrected.rds")
### set "Corrected" matrix as default
> DefaultAssay(mislet_seuratobj_corrected) <- "Corrected"

### Seurat Normalization and FindVariables
> mislet_seuratobj_corrected <- NormalizeData(mislet_seuratobj_corrected,
                                  normalization.method = "LogNormalize",scale.factor = 10000)
Normalizing layer: counts
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> mislet_seuratobj_corrected <- FindVariableFeatures(mislet_seuratobj_corrected,
                                                   selection.method = "vst", nfeatures = 2000)
Finding variable features for layer counts
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Error in hvf.info$variance.expected[not.const] <- 10^fit$fitted :
  NAs are not allowed in subscripted assignments

### try to find NA rows
> RowsNA <- names(which(rowSums(is.na(mislet_seuratobj_corrected@assays$Corrected@layers$counts))>0))
> RowsNA
NULL

Here is session info:

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-conda-linux-gnu
Running under: Debian GNU/Linux 11 (bullseye)

Matrix products: default
BLAS/LAPACK: /data/pathology/program/Miniforge3/envs/R4.2.3/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] hdf5r_1.3.11       scCDC_1.4          tidyr_1.3.1        knitr_1.48
 [5] pROC_1.18.5        Matrix_1.7-0       ggrepel_0.9.6      pbapply_1.7-2
 [9] ggplot2_3.5.1      dplyr_1.1.4        ddpcr_1.15.2       Seurat_5.1.0
[13] SeuratObject_5.0.2 sp_2.1-4

loaded via a namespace (and not attached):
  [1] deldir_2.0-4           gridExtra_2.3          rlang_1.1.4
  [4] magrittr_2.0.3         RcppAnnoy_0.0.22       matrixStats_1.4.1
  [7] ggridges_0.5.6         compiler_4.4.1         spatstat.geom_3.3-3
 [10] png_0.1-8              vctrs_0.6.5            reshape2_1.4.4
 [13] stringr_1.5.1          pkgconfig_2.0.3        fastmap_1.2.0
 [16] utf8_1.2.4             promises_1.3.0         bit_4.5.0
 [19] purrr_1.0.2            xfun_0.47              jsonlite_1.8.9
 [22] goftest_1.2-3          later_1.3.2            spatstat.utils_3.1-0
 [25] irlba_2.3.5.1          parallel_4.4.1         cluster_2.1.6
 [28] R6_2.5.1               ica_1.0-3              stringi_1.8.4
 [31] RColorBrewer_1.1-3     spatstat.data_3.1-2    reticulate_1.39.0
 [34] parallelly_1.38.0      spatstat.univar_3.0-1  lmtest_0.9-40
 [37] scattermore_1.2        Rcpp_1.0.13            tensor_1.5
 [40] future.apply_1.11.2    zoo_1.8-12             sctransform_0.4.1
 [43] httpuv_1.6.15          splines_4.4.1          igraph_2.0.3
 [46] tidyselect_1.2.1       abind_1.4-8            spatstat.random_3.3-2
 [49] codetools_0.2-20       miniUI_0.1.1.1         spatstat.explore_3.3-2
 [52] listenv_0.9.1          lattice_0.22-6         tibble_3.2.1
 [55] plyr_1.8.9             withr_3.0.1            shiny_1.9.1
 [58] ROCR_1.0-11            Rtsne_0.17             future_1.34.0
 [61] fastDummies_1.7.4      survival_3.7-0         polyclip_1.10-7
 [64] fitdistrplus_1.2-1     pillar_1.9.0           KernSmooth_2.23-24
 [67] plotly_4.10.4          generics_0.1.3         RcppHNSW_0.6.0
 [70] munsell_0.5.1          scales_1.3.0           globals_0.16.3
 [73] xtable_1.8-4           glue_1.7.0             lazyeval_0.2.2
 [76] tools_4.4.1            data.table_1.16.0      RSpectra_0.16-2
 [79] RANN_2.6.2             leiden_0.4.3.1         dotCall64_1.1-1
 [82] cowplot_1.1.3          grid_4.4.1             colorspace_2.1-1
 [85] nlme_3.1-166           patchwork_1.3.0        cli_3.6.3
 [88] spatstat.sparse_3.1-0  spam_2.10-0            fansi_1.0.6
 [91] viridisLite_0.4.2      uwot_0.2.2             gtable_0.3.5
 [94] digest_0.6.37          progressr_0.14.0       htmlwidgets_1.6.4
 [97] farver_2.1.2           htmltools_0.5.8.1      lifecycle_1.0.4
[100] httr_1.4.7             mime_0.12              bit64_4.5.2
[103] MASS_7.3-61
chongjing commented 5 days ago

update: error solved, see details here [(https://github.com/ZJU-UoE-CCW-LAB/scCDC/issues/9)]

Hi everyone, I got new debug update here. Hope someone could explain a little bit more. https://github.com/ZJU-UoE-CCW-LAB/scCDC/issues/9#issuecomment-2386507368