greenelab / miQC

Flexible, probablistic metrics for quality control of scRNA-seq data
BSD 3-Clause "New" or "Revised" License
18 stars 1 forks source link

Use case when flexmix returns one component rather than two #2

Closed diegoalexespi closed 3 years ago

diegoalexespi commented 3 years ago

Hi, thanks for the great tool! I've ran into an issue when attempting to use this with some of my samples. There are cases were the flexmix model returns only one component rather than the specified two components. It seems to happen when the cells with the highest mitochondrial proportions also happen to be the cells with the lowest molecule counts. Reproducible example below.

>set.seed(2021)
>suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(scRNAseq)
    library(scater)
    library(flexmix)
    library(splines)
    library(BiocParallel)
    library(miQC)
})
>sce <- ZeiselBrainData()
>mt_genes <- grepl("^mt-",  rownames(sce))
>feature_ctrls <- list(mito = rownames(sce)[mt_genes])
>sce <- addPerCellQC(sce, subsets = feature_ctrls, BPPARAM = MulticoreParam())
>model <- mixtureModel(sce)
>model
Call:
flexmix(formula = subsets_mito_percent ~ detected, data = metrics, k = 2)

Cluster sizes:
   1    2 
2635  370 

convergence after 40 iterations
>plotModel(sce, model) #works

image


#here simulate alternative data more in line with some of my samples
>replacement_mt <- rnorm(mean = 5, sd = 2, n = ncol(sce))
>sce$subsets_mito_percent <- ifelse(sce$detected > 1100, replacement_mt, sce$subsets_mito_percent)
>replacement_mt <- runif(min = 0, max = 60, n = ncol(sce))
>sce$subsets_mito_percent <- ifelse(sce$detected < 1100, replacement_mt, sce$subsets_mito_percent)
>metrics <- as.data.frame(colData(sce))
>ggplot(metrics,aes(x=detected,y=subsets_mito_percent)) + geom_point()

image

>model <- mixtureModel(sce)
>model

Call:
flexmix(formula = subsets_mito_percent ~ detected, data = metrics, k = 2)

Cluster sizes:
   1 
3005 

convergence after 8 iterations
>plotModel(sce, model) #doesn't work
Error in predictions[, 2] : subscript out of bounds

I think this arises from the code snippet below: https://github.com/greenelab/miQC/blob/b13d62ee1f332c83694ddadf4edefb0a095453e5/R/plotModel.R#L55-L56

which is similar to this one as well: https://github.com/greenelab/miQC/blob/b13d62ee1f332c83694ddadf4edefb0a095453e5/R/filterCells.R#L53-L54

happy to provide more info!

>sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] splines   parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] miQC_0.99.5                 BiocParallel_1.24.1         flexmix_2.3-17              lattice_0.20-41            
 [5] scater_1.18.6               ggplot2_3.3.3               scRNAseq_2.4.0              SingleCellExperiment_1.12.0
 [9] SummarizedExperiment_1.20.0 Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.4        
[13] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.0         MatrixGenerics_1.2.1       
[17] matrixStats_0.58.0         

loaded via a namespace (and not attached):
  [1] ggbeeswarm_0.6.0              colorspace_2.0-0              ellipsis_0.3.1                modeltools_0.2-23            
  [5] scuttle_1.0.4                 XVector_0.30.0                BiocNeighbors_1.8.2           rstudioapi_0.13              
  [9] farver_2.1.0                  bit64_4.0.5                   interactiveDisplayBase_1.28.0 AnnotationDbi_1.52.0         
 [13] fansi_0.4.2                   xml2_1.3.2                    sparseMatrixStats_1.2.1       cachem_1.0.4                 
 [17] knitr_1.31                    Rsamtools_2.6.0               dbplyr_2.1.0                  shiny_1.6.0                  
 [21] BiocManager_1.30.10           compiler_4.0.3                httr_1.4.2                    assertthat_0.2.1             
 [25] Matrix_1.3-2                  fastmap_1.1.0                 lazyeval_0.2.2                later_1.1.0.1                
 [29] BiocSingular_1.6.0            htmltools_0.5.1.1             prettyunits_1.1.1             tools_4.0.3                  
 [33] rsvd_1.0.3                    gtable_0.3.0                  glue_1.4.2                    GenomeInfoDbData_1.2.4       
 [37] dplyr_1.0.5                   rappdirs_0.3.3                Rcpp_1.0.6                    vctrs_0.3.6                  
 [41] Biostrings_2.58.0             ExperimentHub_1.16.0          rtracklayer_1.50.0            DelayedMatrixStats_1.12.3    
 [45] xfun_0.22                     stringr_1.4.0                 beachmat_2.6.4                mime_0.10                    
 [49] lifecycle_1.0.0               irlba_2.3.3                   ensembldb_2.14.0              XML_3.99-0.5                 
 [53] AnnotationHub_2.22.0          zlibbioc_1.36.0               scales_1.1.1                  hms_1.0.0                    
 [57] promises_1.2.0.1              ProtGenerics_1.22.0           AnnotationFilter_1.14.0       yaml_2.2.1                   
 [61] curl_4.3                      memoise_2.0.0                 gridExtra_2.3                 biomaRt_2.46.3               
 [65] stringi_1.5.3                 RSQLite_2.2.3                 BiocVersion_3.12.0            GenomicFeatures_1.42.1       
 [69] rlang_0.4.10                  pkgconfig_2.0.3               bitops_1.0-6                  evaluate_0.14                
 [73] purrr_0.3.4                   GenomicAlignments_1.26.0      labeling_0.4.2                bit_4.0.4                    
 [77] tidyselect_1.1.0              magrittr_2.0.1                R6_2.5.0                      generics_0.1.0               
 [81] DelayedArray_0.16.2           DBI_1.1.1                     pillar_1.5.1                  withr_2.4.1                  
 [85] RCurl_1.98-1.2                nnet_7.3-15                   tibble_3.1.0                  crayon_1.4.1                 
 [89] utf8_1.1.4                    BiocFileCache_1.14.0          rmarkdown_2.7                 viridis_0.5.1                
 [93] progress_1.2.2                grid_4.0.3                    blob_1.2.1                    digest_0.6.27                
 [97] xtable_1.8-4                  httpuv_1.5.5                  openssl_1.4.3                 munsell_0.5.0                
[101] beeswarm_0.3.1                viridisLite_0.3.0             vipor_0.4.5                   askpass_1.1      
arielah commented 3 years ago

Apologies for the wait, @diegoalexespi. Thanks for bringing this issue to my attention. Unfortunately, there's no good catchall way to ensure flexmix returns two components for each dataset. miQC is designed to work on a variety of datasets, but one crucial assumption of miQC is that there's an inverse relationship between % mitochondrial reads and number of genes in the compromised cells. If that assumption isn't met, we'd suggest a different method of filtering, such as a simple mitochondrial % cutoff or filtering outliers based on median absolute deviation. In the example you posted, I'd say a basic cutoff of cells above say 15% would suffice.

Changes I've made to the package to reflect this:

diegoalexespi commented 3 years ago

fantastic, thank you so much! I will work on implementing this tool as a Seurat Wrapper soon; I will keep you posted