satijalab / seurat-object

https://satijalab.github.io/seurat-object/
Other
22 stars 26 forks source link

`subset(features = ...)` yields `Error in .subscript.2ary(x, i, j, drop = TRUE) : subscript out of bounds` after SCTransform #208

Open mschilli87 opened 3 months ago

mschilli87 commented 3 months ago
R
R version 4.3.3 (2024-02-29) -- "Angel Food Cake"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
library(Seurat)
Loading required package: SeuratObject
Loading required package: sp

Attaching package: ‘SeuratObject’

The following object is masked from ‘package:base’:

    intersect
library(SeuratData)
── Installed datasets ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── SeuratData v0.2.2.9001 ──
✔ pbmc3k 3.1.4

─────────────────────────────────────────────────────────────────────────────────────── Key ───────────────────────────────────────────────────────────────────────────────────────
✔ Dataset loaded successfully
❯ Dataset built with a newer version of Seurat than installed
❓ Unknown version of Seurat installed
#InstallData("pbmc3k")

sobj <- LoadData("pbmc3k")
Validating object structure
Updating object slots
Ensuring keys are in the proper structure
Warning: Assay RNA changing from Assay to Assay
Ensuring keys are in the proper structure
Ensuring feature names don't have underscores or pipes
Updating slots in RNA
Validating object structure for Assay ‘RNA’
Object representation is consistent with the most current Seurat version
Warning: Assay RNA changing from Assay to Assay5
subset(sobj, features=rownames(sobj))
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
 2 layers present: counts, data
subset(sobj, features=head(rownames(sobj)))
An object of class Seurat
6 features across 2700 samples within 1 assay
Active assay: RNA (6 features, 0 variable features)
 2 layers present: counts, data
sobj <- SCTransform(sobj, vars.to.regress = "nCount_RNA", verbose = FALSE)
subset(sobj, features=rownames(sobj))
An object of class Seurat
25144 features across 2700 samples within 2 assays
Active assay: SCT (12572 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
> subset(sobj, features=head(rownames(sobj)))
subset(sobj, features=head(rownames(sobj)))
Error in .subscript.2ary(x, i, j, drop = TRUE) : subscript out of bounds
sobj <- SCTransform(sobj, vars.to.regress = "nCount_RNA", verbose = FALSE, return.only.var.genes = FALSE)
subset(sobj, features=rownames(sobj))
An object of class Seurat
25144 features across 2700 samples within 2 assays
Active assay: SCT (12572 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
subset(sobj, features=head(rownames(sobj)))
Error in .subscript.2ary(x, i, j, drop = TRUE) : subscript out of bounds

~So for some reason, SCT is dropping features that are still returned as rownames even thought SCT is the active assay. On top, subset fails with a cryptic error message instead of handling this sitatuion gracefully (e.g. by returning a Seurat object with all requested features that are present in the SCT alongside a warning why (and which) features had to be dropped and what to do to fix it). Moreover,~ I ran into this when trying to extract normalized counts for genes that I found differentially expressed in a downstream analysis, so those are not just some weird all-zero count edge cases or such.

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=C.UTF-8           LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Madrid
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2.9001   Seurat_5.0.3
[4] SeuratObject_5.0.1      sp_2.1-3

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          jsonlite_1.8.8
  [3] magrittr_2.0.3              spatstat.utils_3.0-4
  [5] zlibbioc_1.48.0             vctrs_0.6.5
  [7] ROCR_1.0-11                 DelayedMatrixStats_1.24.0
  [9] spatstat.explore_3.2-6      RCurl_1.98-1.14
 [11] S4Arrays_1.2.0              htmltools_0.5.7
 [13] SparseArray_1.2.4           sctransform_0.4.1
 [15] parallelly_1.37.1           KernSmooth_2.23-22
 [17] htmlwidgets_1.6.4           ica_1.0-3
 [19] plyr_1.8.9                  plotly_4.10.4
 [21] zoo_1.8-12                  igraph_2.0.3
 [23] mime_0.12                   lifecycle_1.0.4
 [25] pkgconfig_2.0.3             Matrix_1.6-5
 [27] R6_2.5.1                    fastmap_1.1.1
 [29] GenomeInfoDbData_1.2.11     MatrixGenerics_1.14.0
 [31] fitdistrplus_1.1-11         future_1.33.1
 [33] shiny_1.8.0                 digest_0.6.34
 [35] colorspace_2.1-0            patchwork_1.2.0
 [37] S4Vectors_0.40.2            tensor_1.5
 [39] RSpectra_0.16-1             irlba_2.3.5.1
 [41] GenomicRanges_1.54.1        progressr_0.14.0
 [43] fansi_1.0.6                 spatstat.sparse_3.0-3
 [45] httr_1.4.7                  polyclip_1.10-6
 [47] abind_1.4-5                 compiler_4.3.3
 [49] fastDummies_1.7.3           MASS_7.3-60.0.1
 [51] DelayedArray_0.28.0         rappdirs_0.3.3
 [53] tools_4.3.3                 lmtest_0.9-40
 [55] httpuv_1.6.14               future.apply_1.11.1
 [57] goftest_1.2-3               glmGamPoi_1.14.3
 [59] glue_1.7.0                  nlme_3.1-164
 [61] promises_1.2.1              grid_4.3.3
 [63] Rtsne_0.17                  cluster_2.1.6
 [65] reshape2_1.4.4              generics_0.1.3
 [67] gtable_0.3.4                spatstat.data_3.0-4
 [69] tidyr_1.3.1                 data.table_1.15.2
 [71] XVector_0.42.0              utf8_1.2.4
 [73] BiocGenerics_0.48.1         spatstat.geom_3.2-9
 [75] RcppAnnoy_0.0.22            ggrepel_0.9.5
 [77] RANN_2.6.1                  pillar_1.9.0
 [79] stringr_1.5.1               spam_2.10-0
 [81] RcppHNSW_0.6.0              later_1.3.2
 [83] splines_4.3.3               dplyr_1.1.4
 [85] lattice_0.22-5              survival_3.5-8
 [87] deldir_2.0-4                tidyselect_1.2.0
 [89] miniUI_0.1.1.1              pbapply_1.7-2
 [91] gridExtra_2.3               IRanges_2.36.0
 [93] SummarizedExperiment_1.32.0 scattermore_1.2
 [95] stats4_4.3.3                Biobase_2.62.0
 [97] matrixStats_1.2.0           stringi_1.8.3
 [99] lazyeval_0.2.2              codetools_0.2-19
[101] tibble_3.2.1                cli_3.6.2
[103] uwot_0.1.16                 xtable_1.8-4
[105] reticulate_1.35.0           munsell_0.5.0
[107] Rcpp_1.0.12                 GenomeInfoDb_1.38.6
[109] globals_0.16.2              spatstat.random_3.2-3
[111] png_0.1-8                   parallel_4.3.3
[113] ellipsis_0.3.2              ggplot2_3.5.0
[115] dotCall64_1.1-1             sparseMatrixStats_1.14.0
[117] bitops_1.0-7                listenv_0.9.1
[119] viridisLite_0.4.2           scales_1.3.0
[121] ggridges_0.5.6              leiden_0.4.3.1
[123] purrr_1.0.2                 crayon_1.5.2
[125] rlang_1.1.3                 cowplot_1.1.3
mschilli87 commented 3 months ago

It gets even weirder:

nrow(sobj)
12572
nrow(GetAssayData(sobj)) 
12572
subset(sobj, features=rownames(GetAssayData(sobj)))
An object of class Seurat
25144 features across 2700 samples within 2 assays
Active assay: SCT (12572 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
subset(sobj, features=head(rownames(GetAssayData(sobj))))
Error in .subscript.2ary(x, i, j, drop = TRUE) : subscript out of bounds

So the SCT assay does include all features but still the subset function cannot handle subsetting the Seurat object by features.

mschilli87 commented 3 months ago
DefaultAssay(sobj) <- "RNA"
sobj$SCT <- NULL
subset(sobj, features=rownames(sobj))
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
 2 layers present: counts, data
subset(sobj, features=head(rownames(sobj)))
An object of class Seurat
6 features across 2700 samples within 1 assay
Active assay: RNA (6 features, 0 variable features)
 2 layers present: counts, data
sobj$copy <- sobj$RNA
Warning message:
Key ‘rna_’ taken, using ‘copy_’ instead
subset(sobj, features=rownames(sobj))
An object of class Seurat
27428 features across 2700 samples within 2 assays
Active assay: RNA (13714 features, 0 variable features)
 2 layers present: counts, data
 1 other assay present: copy
subset(sobj, features=head(rownames(sobj)))
12 features across 2700 samples within 2 assays
Active assay: RNA (6 features, 0 variable features)
 2 layers present: counts, data
 1 other assay present: copy

So this is not just about the presence of any other second assay, but something specific about the SCT one.

mschilli87 commented 3 months ago

@satijalab, @mojaveazure: I just noticed I posted this on the seurat-data repo. Would you mind transferring it over to seurat, so I don't need to close it here and re-open it there? Thank you!

mschilli87 commented 3 months ago

cc @Alexis-Varin, as you had some helpful insights for a similar issue.

Alexis-Varin commented 3 months ago

What happens if, while default assay is SCT, you do subset(sobj, features=head(rownames(sobj[["SCT"]]$data))) #pretty sure SCTransform puts the normalized matrix in data but it could be in counts

and if subset(sobj[["SCT"]], features=head(rownames(sobj))) subset(sobj[["SCT"]], features=head(rownames(sobj[["SCT"]]$data))) subset(sobj[["RNA"]], features=head(rownames(sobj))) subset(sobj[["RNA"]], features=head(rownames(sobj[["RNA"]]$data))) works

mschilli87 commented 3 months ago
subset(sobj, features=head(rownames(sobj[["SCT"]]$data)))
Error in .subscript.2ary(x, i, j, drop = TRUE) : subscript out of bounds
subset(sobj, features=head(rownames(sobj[["SCT"]]$counts)))
Error in .subscript.2ary(x, i, j, drop = TRUE) : subscript out of bounds
subset(sobj[["SCT"]], features=head(rownames(sobj)))
SCTAssay data with 6 features for 2700 cells, and 1 SCTModel(s)
Top 1 variable feature:
 NOC2L
subset(sobj[["SCT"]], features=head(rownames(sobj[["SCT"]]$data)))
SCTAssay data with 6 features for 2700 cells, and 1 SCTModel(s)
Top 1 variable feature:
 NOC2L
subset(sobj[["RNA"]], features=head(rownames(sobj)))
Warning: Different features in new layer data than already exists for counts
Warning: Different features in new layer data than already exists for data
Assay (v5) data with 6 features for 2700 cells
First 6 features:
 AL627309.1, RP11-206L10.2, LINC00115, NOC2L, KLHL17, PLEKHN1
Layers:
 counts, data
subset(sobj[["RNA"]], features=head(rownames(sobj[["RNA"]]$data)))
Warning: Different features in new layer data than already exists for counts
Warning: Different features in new layer data than already exists for data
Assay (v5) data with 6 features for 2700 cells
First 6 features:
 AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9, LINC00115, NOC2L
Layers:
 counts, data
mschilli87 commented 3 months ago

Is there anything I can do to track this down? Is SCT simply not supported/maintained/tested in Seurat anymore? If so, maybe the corresponding vignette should be updated/archived to reflect that new reality?

mschilli87 commented 3 months ago

In case it helps: I tracked it down to this line, which passes when assay is "RNA" but fails when it is "SCT" for the example data above.


update: I can also trigger it here with the above SCT assay and layer defaulting to 'counts'. update: This line triggers the issue with cells being set to colnames(object) and features to Features(object, layer) (layer still defaulting to "counts").

update: I checked and it turns out that Features(object, 'counts') returns all features in the Assay pre-filtering (just like Features(sobj[["SCT"]], 'counts')), but rownames(object@counts) misses most of them as it only contains the filtered data at this point.

mschilli87 commented 3 months ago

Here is how far I got:

subset.Seurat(features = ...) starts by subsetting the Assay data and then updates the n.counts via CalcN which uses the LayerData of the (subset!) SCT Assay for the 'counts' layer. While LayerData does support limiting the query to a specific set of features, the corresponding features parameter defauls to NULL and is not changed by the call in CalcN (neither explicitly, nor via passing on the ...). Thus, LayerData queries all features via the Features function which returns the subset list of features when using the (default) 'data' layer, but still returns the full list of (unfiltered!) features when setting layer = 'counts' as done by LayerData.

I don't know if this is the expected behaviour and the fix would be to pass the filtered feature to the LayerData call or if this is a bug in Feature that ought be fixed instead.

@yuhanH, @mojaveazure: Any idea what is going wrong? AFAICT after a quick git blame you worked on the corresponding code section last.

mschilli87 commented 3 months ago

I can confirm that modifying CaclN to pass the layer parameter to override the 'counts' default value 'fixes' this issue. However, I have no clue if it results in the correct results and/or if I this breaking anything else. I can provide a PR if this eases the review but I'd like to get some sort of feedback before putting in more work into this.