satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

Filtering based on QV threshold using 'ReadXenium' and 'LoadXenium' does not work #8516

Open An17aV0 opened 8 months ago

An17aV0 commented 8 months ago

I am trying to filter my Xenium data based on the quality values (QV) of the transcripts using "ReadXenium" and "mols.qv.threshold". However, if I filter with 2 different QV (e.g. high and low), the gene expression values do not change between the two datasets. Please also advise how I specify the QV threshold when generating the SeuratObject using 'LoadXenium', given that 'LoadXenium' is a wrapper of 'ReadXenium'?

The following code reads the same public dataset with 'ReadXenium' and two different thresholds and then correlates the data. The result is that the correlation coefficient is the same (1) for each gene which is unexpected if the filtering would work. Please advise how to correct this.

# Load data
wget https://cf.10xgenomics.com/samples/xenium/1.0.2/Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP/Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs.zip
unzip Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs.zip

#   Read Data with high qv threshold
test.read35<- ReadXenium(data.dir="Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs",
  outs = c("matrix", "microns"),
  type = "centroids",
  mols.qv.threshold = 35)

# Read Data with low qv threshold
test.read0<- ReadXenium( data.dir="Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs",
  outs = c("matrix", "microns"),
  type = "centroids",
  mols.qv.threshold = 10)

# compare the 2 seurat objects 
raw_counts1 <- test.read35$matrix$`Gene Expression`
raw_counts2 <- test.read0$matrix$`Gene Expression`

# Check dimensions to ensure compatibility
if (!all(dim(raw_counts1) == dim(raw_counts2))) {
  stop("Raw count matrices have different dimensions.")
}

# Compare raw counts for each gene
gene_names <- rownames(raw_counts1)
comparison_result <- sapply(gene_names, function(gene) {
  counts1 <- raw_counts1[gene, ]
  counts2 <- raw_counts2[gene, ]
  correlation <- cor(counts1, counts2)
  return(correlation)
})

print(comparison_result)   # they are currently all 1

My sessionInfo:

R version 4.3.2 (2023-10-31) Platform: x86_64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.6.3

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] scales_1.3.0 readxl_1.4.3 RColorBrewer_1.1-3 reshape2_1.4.4 Signac_1.12.0 magrittr_2.0.3 ggpubr_0.6.0
[8] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[15] ggplot2_3.4.4 tidyverse_2.0.0 sf_1.0-15 hdf5r_1.3.8 patchwork_1.2.0 plyr_1.8.9 dplyr_1.1.4
[22] Seurat_5.0.1 SeuratObject_5.0.1 sp_2.1-2 Rcpp_1.0.12

loaded via a namespace (and not attached): [1] RcppAnnoy_0.0.21 splines_4.3.2 later_1.3.2 bitops_1.0-7 cellranger_1.1.0
[6] R.oo_1.26.0 polyclip_1.10-6 fastDummies_1.7.3 lifecycle_1.0.4 rstatix_0.7.2
[11] rprojroot_2.0.4 globals_0.16.2 processx_3.8.2 lattice_0.21-9 MASS_7.3-60
[16] backports_1.4.1 limma_3.56.2 plotly_4.10.4 remotes_2.4.2.1 httpuv_1.6.14
[21] glmGamPoi_1.12.2 sctransform_0.4.1 spam_2.10-0 sessioninfo_1.2.2 pkgbuild_1.4.2
[26] spatstat.sparse_3.0-3 reticulate_1.35.0 cowplot_1.1.2 pbapply_1.7-2 DBI_1.2.2
[31] zlibbioc_1.48.0 abind_1.4-5 pkgload_1.3.3 GenomicRanges_1.54.1 Rtsne_0.17
[36] presto_1.0.0 R.utils_2.12.3 RCurl_1.98-1.14 BiocGenerics_0.48.1 GenomeInfoDbData_1.2.11
[41] IRanges_2.36.0 S4Vectors_0.40.2 ggrepel_0.9.5 irlba_2.3.5.1 listenv_0.9.0
[46] spatstat.utils_3.0-4 units_0.8-5 goftest_1.2-3 RSpectra_0.16-1 spatstat.random_3.2-2
[51] fitdistrplus_1.1-11 parallelly_1.36.0 DelayedMatrixStats_1.22.1 DelayedArray_0.28.0 RcppRoll_0.3.0
[56] leiden_0.4.3.1 codetools_0.2-19 tidyselect_1.2.0 farver_2.1.1 stats4_4.3.2
[61] matrixStats_1.2.0 spatstat.explore_3.2-5 jsonlite_1.8.8 e1071_1.7-13 ellipsis_0.3.2
[66] progressr_0.14.0 ggridges_0.5.5 survival_3.5-7 systemfonts_1.0.4 tools_4.3.2
[71] ragg_1.2.5 ica_1.0-3 glue_1.7.0 SparseArray_1.2.4 gridExtra_2.3
[76] MatrixGenerics_1.14.0 usethis_2.2.2 GenomeInfoDb_1.38.6 withr_3.0.0 fastmap_1.1.1
[81] fansi_1.0.6 callr_3.7.3 digest_0.6.34 timechange_0.3.0 R6_2.5.1
[86] mime_0.12 textshaping_0.3.6 colorspace_2.1-0 scattermore_1.2 tensor_1.5
[91] spatstat.data_3.0-4 R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3 data.table_1.15.0
[96] class_7.3-22 S4Arrays_1.2.0 prettyunits_1.2.0 httr_1.4.7 htmlwidgets_1.6.4
[101] uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.4 lmtest_0.9-40 XVector_0.42.0
[106] htmltools_0.5.7 carData_3.0-5 profvis_0.3.8 dotCall64_1.1-1 Biobase_2.62.0
[111] png_0.1-8 rstudioapi_0.15.0 tzdb_0.4.0 nlme_3.1-163 curl_5.2.0
[116] proxy_0.4-27 cachem_1.0.8 zoo_1.8-12 KernSmooth_2.23-22 parallel_4.3.2
[121] miniUI_0.1.1.1 desc_1.4.2 pillar_1.9.0 grid_4.3.2 vctrs_0.6.5
[126] RANN_2.6.1 urlchecker_1.0.1 promises_1.2.1 car_3.1-2 xtable_1.8-4
[131] cluster_2.1.4 Rsamtools_2.18.0 cli_3.6.2 compiler_4.3.2 rlang_1.1.3
[136] crayon_1.5.2 future.apply_1.11.1 ggsignif_0.6.4 labeling_0.4.3 classInt_0.4-10
[141] ps_1.7.5 fs_1.6.3 stringi_1.8.3 BiocParallel_1.36.0 viridisLite_0.4.2
[146] deldir_2.0-2 Biostrings_2.70.2 munsell_0.5.0 lazyeval_0.2.2 devtools_2.4.5
[151] spatstat.geom_3.2-7 Matrix_1.6-5 RcppHNSW_0.5.0 hms_1.1.3 sparseMatrixStats_1.12.2
[156] bit64_4.0.5 future_1.33.1 shiny_1.8.0 SummarizedExperiment_1.32.0 ROCR_1.0-11
[161] igraph_1.6.0 broom_1.0.5 memoise_2.0.1 fastmatch_1.1-4 bit_4.0.5

alikhuseynov commented 8 months ago

mols.qv.threshold this arg only filters the molecule (transcript) coordinates not the count matrix. https://github.com/satijalab/seurat/blob/develop/R/preprocessing.R#L2215 The count matrix is already filtered by default (QC > 20) in Xenium pipeline. So your count matrices should be identical -> identical(raw_counts1, raw_counts2) # should return TRUE

now, if you want to filter out some genes given selected mols.qv.threshold value, then use the gene variable from molecule coord dataframe and subset the count matrix or object.

# say this is the mol coords dataframe
test.read35$microns %>% str
features35 <- test.read35$microns$gene %>% unique()

# check that it includes only targeted probes/genes (if that's what you are focusing), ie no other NegativeCode etc..
features35 %>% str
features35 <- grep("NegControlProbe|Blank|Code", features35, invert = TRUE, value = TRUE)

# use this to filter matrix 
mat35 <- test.read35$matrix$`Gene Expression`[features35, ]
# or on object
obj %<>% subset(features = features35)

hope this works.