satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

V5 Assay Variable Features are not ordered #7829

Closed samuel-marsh closed 1 year ago

samuel-marsh commented 1 year ago

Hi Seurat Team,

Not sure I'd have time currently to create PR or exactly how you all want to handle it now that the information is being stored slightly differently so just posting this as issue.

When using V5 assay the variable features are not stored ordered in terms of their standardized variance. This subsequently can cause issues when trying to quickly ascertain top variable features or to label VariableFeaturePlot without first sorting the information.

Reproducible example here:

library(Seurat)

# Std Assay object
pbmc <- pbmc3k.SeuratData::pbmc3k
pbmc <- UpdateSeuratObject(pbmc)

# V5 Assay Object
pbmc_mat <- pbmc@assays$RNA$counts
options(Seurat.object.assay.version = "v5")
pbmc_V5 <- CreateSeuratObject(counts = pbmc_mat)

# Normalize and find features
pbmc <- NormalizeData(pbmc)
pbmc_V5 <- NormalizeData(pbmc_V5)

pbmc <- FindVariableFeatures(pbmc)
pbmc_V5 <- FindVariableFeatures(pbmc_V5)

# Extract top features
top10 <- head(VariableFeatures(pbmc), 10)
top10_V5 <- head(VariableFeatures(pbmc_V5), 10)

# plot
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

plot3 <- VariableFeaturePlot(pbmc_V5)
plot4 <- LabelPoints(plot = plot3, points = top10_V5, repel = TRUE)
plot3 + plot4

image image

sessionInfo() output ```r > sessionInfo() R version 4.3.1 Patched (2023-09-04 r85065) Platform: x86_64-apple-darwin20 (64-bit) Running under: macOS Monterey 12.6.8 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/New_York tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] scCustomize_1.9.9.9014 Seurat_4.9.9.9060 SeuratObject_4.9.9.9091 sp_2.0-0 loaded via a namespace (and not attached): [1] RcppAnnoy_0.0.21 splines_4.3.1 later_1.3.1 bitops_1.0-7 R.oo_1.25.0 tibble_3.2.1 [7] polyclip_1.10-4 janitor_2.2.0 fastDummies_1.7.3 lifecycle_1.0.3 rstatix_0.7.2 globals_0.16.2 [13] lattice_0.21-8 MASS_7.3-60 backports_1.4.1 magrittr_2.0.3 plotly_4.10.2 remotes_2.4.2.1 [19] httpuv_1.6.11 sctransform_0.3.5 flexmix_2.3-19 spam_2.9-1 spatstat.sparse_3.0-2 reticulate_1.32.0 [25] cowplot_1.1.1 pbapply_1.7-2 RColorBrewer_1.1-3 lubridate_1.9.2 zlibbioc_1.46.0 abind_1.4-5 [31] Rtsne_0.16 GenomicRanges_1.52.0 R.utils_2.12.2 purrr_1.0.2 BiocGenerics_0.46.0 RCurl_1.98-1.12 [37] nnet_7.3-19 circlize_0.4.15 GenomeInfoDbData_1.2.10 IRanges_2.34.1 S4Vectors_0.38.1 ggrepel_0.9.3 [43] irlba_2.3.5.1 listenv_0.9.0 spatstat.utils_3.0-3 SeuratWrappers_0.3.19 goftest_1.2-3 RSpectra_0.16-1 [49] spatstat.random_3.1-6 fitdistrplus_1.1-11 parallelly_1.36.0 pbmc3k.SeuratData_3.1.4 DelayedMatrixStats_1.22.6 leiden_0.4.3 [55] codetools_0.2-19 DelayedArray_0.26.7 scuttle_1.10.2 RApiSerialize_0.1.2 tidyselect_1.2.0 shape_1.4.6 [61] farver_2.1.1 matrixStats_1.0.0 stats4_4.3.1 spatstat.explore_3.2-3 jsonlite_1.8.7 ellipsis_0.3.2 [67] progressr_0.14.0 ggridges_0.5.4 survival_3.5-7 tools_4.3.1 ica_1.0-3 Rcpp_1.0.11 [73] glue_1.6.2 gridExtra_2.3 MatrixGenerics_1.12.3 GenomeInfoDb_1.36.3 dplyr_1.1.3 withr_2.5.0 [79] BiocManager_1.30.22 fastmap_1.1.1 fansi_1.0.4 rsvd_1.0.5 digest_0.6.33 timechange_0.2.0 [85] R6_2.5.1 mime_0.12 ggprism_1.0.4 colorspace_2.1-0 scattermore_1.2 tensor_1.5 [91] spatstat.data_3.0-1 R.methodsS3_1.8.2 utf8_1.2.3 tidyr_1.3.0 generics_0.1.3 data.table_1.14.8 [97] httr_1.4.7 htmlwidgets_1.6.2 S4Arrays_1.0.6 uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.4 [103] modeltools_0.2-23 lmtest_0.9-40 SingleCellExperiment_1.22.0 XVector_0.40.0 htmltools_0.5.6 carData_3.0-5 [109] dotCall64_1.0-2 Biobase_2.60.0 scales_1.2.1 png_0.1-8 snakecase_0.11.1 rstudioapi_0.15.0 [115] reshape2_1.4.4 nlme_3.1-163 zoo_1.8-12 GlobalOptions_0.1.2 stringr_1.5.0 KernSmooth_2.23-22 [121] parallel_4.3.1 miniUI_0.1.1.1 vipor_0.4.5 ggrastr_1.0.2 pillar_1.9.0 grid_4.3.1 [127] vctrs_0.6.3 RANN_2.6.1 promises_1.2.1 ggpubr_0.6.0 car_3.1-2 stringfish_0.15.8 [133] beachmat_2.16.0 xtable_1.8-4 cluster_2.1.4 beeswarm_0.4.0 paletteer_1.5.0 cli_3.6.1 [139] compiler_4.3.1 rlang_1.1.1 crayon_1.5.2 future.apply_1.11.0 ggsignif_0.6.4 labeling_0.4.3 [145] rematch2_2.1.2 plyr_1.8.8 forcats_1.0.0 ggbeeswarm_0.7.2 stringi_1.7.12 viridisLite_0.4.2 [151] deldir_1.0-9 BiocParallel_1.34.2 munsell_0.5.0 lazyeval_0.2.2 spatstat.geom_3.2-5 Matrix_1.6-1 [157] RcppHNSW_0.4.1 patchwork_1.1.3 sparseMatrixStats_1.12.2 future_1.33.0 ggplot2_3.4.3 shiny_1.7.5 [163] SummarizedExperiment_1.30.2 qs_0.25.5 ROCR_1.0-11 igraph_1.5.1 broom_1.0.5 RcppParallel_5.1.7 ```

Thanks as always for all that you do!!

Best, Sam

samuel-marsh commented 1 year ago

Hi Seurat Team,

I have also discovered a couple of other bugs relating to VariableFeatures and V5 assays. I believe these two are probably related but listing them separately.

Here is the code run that results in both issues:

# MEAN.VAR.PLOT ONLY
pbmc_mat <- pbmc@assays$RNA$counts
options(Seurat.object.assay.version = "v5")
pbmc_V5 <- CreateSeuratObject(counts = pbmc_mat)
pbmc_V5 <- NormalizeData(pbmc_V5)
pbmc_V5 <- FindVariableFeatures(object = pbmc_V5, selection.method = "mean.var.plot", verbose = TRUE)
VariableFeaturePlot(pbmc_V5)
var_meanV5 <- VariableFeatures(pbmc_V5)
pbmc_V5 <- ScaleData(pbmc_V5)
pbmc_V5 <- RunPCA(pbmc_V5)
pbmc_V5 <- FindNeighbors(pbmc_V5, dims = 1:10)
pbmc_V5 <- FindClusters(pbmc_V5)
pbmc_V5 <- RunUMAP(pbmc_V5, dims = 1:10)
p1_mean_var <- DimPlot(pbmc_V5) + ggtitle("mean.var.plot only")

# VST ONLY
pbmc_mat <- pbmc@assays$RNA$counts
options(Seurat.object.assay.version = "v5")
pbmc_V5 <- CreateSeuratObject(counts = pbmc_mat)
pbmc_V5 <- FindVariableFeatures(pbmc_V5, nfeatures = 500, selection.method = "vst")
VariableFeaturePlot(pbmc_V5)
var_vstV5 <- VariableFeatures(pbmc_V5)
pbmc_V5 <- ScaleData(pbmc_V5)
pbmc_V5 <- RunPCA(pbmc_V5)
pbmc_V5 <- FindNeighbors(pbmc_V5, dims = 1:10)
pbmc_V5 <- FindClusters(pbmc_V5)
pbmc_V5 <- RunUMAP(pbmc_V5, dims = 1:10)
p1_vst <- DimPlot(pbmc_V5) + ggtitle("vst only")

# BOTH
pbmc_mat <- pbmc@assays$RNA$counts
options(Seurat.object.assay.version = "v5")
pbmc_V5 <- CreateSeuratObject(counts = pbmc_mat)
pbmc_V5 <- NormalizeData(pbmc_V5)
pbmc_V5 <- FindVariableFeatures(object = pbmc_V5, selection.method = "mean.var.plot", verbose = TRUE)
pbmc_V5 <- FindVariableFeatures(pbmc_V5, nfeatures = 500, selection.method = "vst")
var_both <- VariableFeatures(pbmc_V5)
pbmc_V5 <- ScaleData(pbmc_V5)
pbmc_V5 <- RunPCA(pbmc_V5)
pbmc_V5 <- FindNeighbors(pbmc_V5, dims = 1:10)
pbmc_V5 <- FindClusters(pbmc_V5)
pbmc_V5 <- RunUMAP(pbmc_V5, dims = 1:10)
p1_meanvar_vst <- DimPlot(pbmc_V5) + ggtitle("both")

1. VariableFeaturePlot does not update to most recently run variable feature method if two methods are run. After running the "BOTH" scenario the VariableFeaturePlot fails to update to correct axes and number of genes:

VariableFeaturePlot(object = pbmc_V5)

image

Even if you manually supply the selection method that appears to be ignored:

VariableFeaturePlot(object = pbmc_V5, selection.method = "vst")

image

2a. While plot reports incorrect number of variable features, VariableFeatures() reports correct number.

> length(var_both)
[1] 500

2b. But the features reported are not the same as when run alone.

> all(var_vst %in% var_both)
[1] FALSE

3. Downstream processes do not use only the last iteration of FindVariableFeatures. The resulting UMAP plots look like this: image

So not only are the UMAPs different (although overall fairly similar) but clustering is different too.

So it appears that downstream functions are not properly utilizing the most recent information from FindVariableFeatures.

Best, Sam

samuel-marsh commented 1 year ago

Hi,

This also has potential to cause issues upon object subset and running new pipeline if variable feature method is changed since OBJECT@assays$RNA@meta.data is maintained (even after DietSeurat). I understand why it's maintained (since you can store other info there) but since variable feature selection is not being wholly respected it causes issues.

Best, Sam

Gesmira commented 1 year ago

Hi Sam, Thank you so much for bringing this to our attention and the thorough reproducible example. We've implemented a fix which should be pushed to the public seurat5 branches soon at which point I'll respond here. Best, Gesi

Gesmira commented 1 year ago

Hi Sam, We just pushed a few updates to the "seurat5" branch where this issue is fixed. Let me know if you run into any further issues with it as we refactored a lot of the VariableFeatures code. Hopefully should be all good now though!

erzakiev commented 5 months ago

Just to confirm, the vector of variable features, returned by VariableFeatures(seuratObj) is sorted in decreasing order of variance, right?

elifozcelik commented 2 weeks ago

@erzakiev hey were you able to figure out this question's answer?