satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 908 forks source link

IntegrateData results in cells with NA values #2902

Closed shalvorsen-ccib closed 4 years ago

shalvorsen-ccib commented 4 years ago

Hello,

After data integration, I encountered a similar error as mentioned in issue #1879 (although I am not using the reciprocal PCA method) and issue #1788. I ran into this error:

Error in irlba(A = t(x = object), nv = npcs, ...) : 
  max(nu, nv) must be positive

While using the RunPCA function. That original issue never seemed resolved. My particular dataset includes about 58k cells across 26 patients. When I downsampled the number of cells to a maximum of 2k per patient, the issue went away. Eventually, I traced the issue back to the IntegrateData function. It seems to result in NA values in a few cells, which causes the ScaleData function to result in 0 values, and thus 0 variance.

I have managed to get the RunPCA function working again by identifying and removing the cells with 0 values:

na_cells <- colSums(as.matrix(is.na(seur.integrated@assays$integrated@data)))
to_remove <- names(na_cells)[na_cells>0]
if(length(to_remove) > 0){
  print("Found some cells with NA values. They will be removed. These are the cells:")
  print(to_remove)
  seur.integrated@meta.data$keep <- TRUE
  seur.integrated@meta.data[to_remove, "keep"] <- FALSE
  seur.integrated <- subset(seur.integrated, keep)
  seur.integrated@meta.data <- seur.integrated@meta.data[, -c(which(colnames(seur.integrated@meta.data) == "keep"))]
}

and re-running the ScaleData function.

I am just wondering if this is an acceptable approach? I only needed to remove 3 cells in my case, although the exact cells (and number) seems to have a random component -- it varies if I re-run the IntegrateData function. The number of cells I need to remove is thus very low and not of particular consequence for my analysis. However, I wanted to make sure the presence of NA values in the integrated assay is not indicative of another underlying problem that would raise concerns about the validity of the results.

Here is my package and version info:

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Seurat_3.1.4

loaded via a namespace (and not attached):
  [1] tsne_0.1-3          nlme_3.1-143        bitops_1.0-6        RcppAnnoy_0.0.16    RColorBrewer_1.1-2  httr_1.4.1          numDeriv_2016.8-1.1 sctransform_0.2.1  
  [9] tools_3.6.2         R6_2.4.1            irlba_2.3.3         KernSmooth_2.23-16  uwot_0.1.8          BiocGenerics_0.32.0 lazyeval_0.2.2      colorspace_1.4-1   
 [17] sn_1.5-5            withr_2.1.2         npsurv_0.4-0        gridExtra_2.3       tidyselect_1.0.0    mnormt_1.5-6        compiler_3.6.2      cli_2.0.2          
 [25] Biobase_2.46.0      TFisher_0.2.0       plotly_4.9.2        sandwich_2.5-1      labeling_0.3        caTools_1.18.0      scales_1.1.0        lmtest_0.9-37      
 [33] mvtnorm_1.1-0       ggridges_0.5.2      pbapply_1.4-2       rappdirs_0.3.1      stringr_1.4.0       digest_0.6.25       pkgconfig_2.0.3     htmltools_0.4.0    
 [41] sessioninfo_1.1.1   bibtex_0.4.2.2      plotrix_3.7-7       htmlwidgets_1.5.1   rlang_0.4.5         rstudioapi_0.11     farver_2.0.3        zoo_1.8-7          
 [49] jsonlite_1.6.1      ica_1.0-2           gtools_3.8.1        dplyr_0.8.5         magrittr_1.5        patchwork_1.0.0     Matrix_1.2-18       fansi_0.4.1        
 [57] Rcpp_1.0.3          munsell_0.5.0       ape_5.3             reticulate_1.14     lifecycle_0.2.0     stringi_1.4.6       multcomp_1.4-12     gbRd_0.4-11        
 [65] MASS_7.3-51.5       gplots_3.0.3        Rtsne_0.15          plyr_1.8.6          grid_3.6.2          parallel_3.6.2      gdata_2.18.0        listenv_0.8.0      
 [73] ggrepel_0.8.2       crayon_1.3.4        lattice_0.20-38     cowplot_1.0.0       splines_3.6.2       multtest_2.42.0     knitr_1.28          pillar_1.4.3       
 [81] igraph_1.2.4.2      reshape2_1.4.3      future.apply_1.4.0  codetools_0.2-16    stats4_3.6.2        leiden_0.3.3        mutoss_0.1-12       glue_1.3.2         
 [89] lsei_1.2-0          metap_1.3           data.table_1.12.8   png_0.1-7           vctrs_0.2.4         Rdpack_0.11-1       tidyr_1.0.2         gtable_0.3.0       
 [97] RANN_2.6.1          purrr_0.3.3         future_1.16.0       assertthat_0.2.1    ggplot2_3.3.0       xfun_0.12           rsvd_1.0.3          viridisLite_0.3.0  
[105] survival_3.1-8      tibble_2.1.3        cluster_2.1.0       globals_0.12.5      fitdistrplus_1.0-14 TH.data_1.0-10      ROCR_1.0-7 
satijalab commented 4 years ago

Thanks for sharing this. This is a bug we don't understand - but does seem to be popping up for multiple users.

We'd like to look into this in more detail.

  1. Are you able to reproduce it with just two datasets (or anything less than the full set)? That woudl make it easier to debug.
  2. Would you be comfortable sharing your data so we can take a look?

Thanks!

shalvorsen-ccib commented 4 years ago

Hello,

1.) I am not able to reproduce the error with a significantly reduced dataset. I tried downsampling the dataset by about 50%, and the issue did not reoccur. It's possible that somewhere between 50% and 100% of the full dataset would still result in the issue described, but I have not done an iterative procedure to determine the minimum number of cells required to reproduce the issue.

2.) Sure, I would be able to privately provide you a de-identified dataset. Please let me know how you'd like me to share it.

Thanks very much!

davidfstein commented 4 years ago

I am running into the same bug though I am attempting to integrate data from three different species. I was wondering if any progress has been made on this issue?

timoast commented 4 years ago

@stefanhal you can email a download link for the data (dropbox, google drive, etc) to tstuart@nygenome.org

timoast commented 4 years ago

Closing this as I have not received any email

kuangyuyen commented 3 years ago

I have the same issue. I can send you the data frame and associated code. Please advise which email I can send those files to you. Thanks a lot

e-arslan commented 5 months ago

Any solution for this?