satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

Issue with the merge function of Seurat handling objects with different sets of genes #8760

Open leonfodoulian opened 6 months ago

leonfodoulian commented 6 months ago

Hi,

I believe there is an issue with the merge function of Seurat. The problem lies in the way Seurat handles the feature.attributes and scale.data slots of two objects with different sets of expressed genes (though with a high overlap) on which Seurat::SCTransform() was computed. Running the code in two different ways (but essentially identical in terms of the expected outcome) results in very different results:

  1. Merge the two Seurat objects then run Seurat::SCTransform() (referred to as the "standard Seurat version 5 workflow" in the code);
  2. Run Seurat::SCTransform() then merge the objects (referred to as the "lapply approach" in the code).

For reproducibility, please check the code below:

# Required packages
require(Seurat)

# Load Broad Institute PBMC Systematic Comparative Analysis dataset
obj <- SeuratData::LoadData("pbmcsca")

# Subset two datasets and prepare data for processing
obj.l <- lapply(
  X = c("smartseq2" = "Smart-seq2",
        "celseq2" = "CEL-Seq2"),
  FUN = function(method,
                 obj) {
    # Retrieve counts data for dataset from each method
    counts.mat <- GetAssayData(object = subset(x = obj,
                                               subset = Method == method),
                               assay = "RNA",
                               layer = "counts")

    # Remove all genes not expressed in at least 3 cells
    genes.to.keep <- rowSums(x = counts.mat > 0) >= 3 
    counts.mat <- counts.mat[genes.to.keep,]

    # Create a new Seurat object
    obj <- CreateSeuratObject(
      counts = counts.mat,
      assay = "RNA",
      min.cells = 0,
      min.features = 0,
    )

    # Add method to new Seurat object
    obj@meta.data$Method <- method

    # Return object
    return(obj)
  },
  obj = obj)

# Get maximum number of cells and genes
ncells <- Reduce(f = max, x = lapply(X = obj.l, FUN = ncol))
ngenes <- Reduce(f = max, x = lapply(X = obj.l, FUN = nrow))

# Prepare data for standard Seurat version 5 workflow
obj.merged.v5 <-  merge(x = obj.l$smartseq2,
                        y = obj.l$celseq2,
                        add.cell.ids = NULL,
                        collapse = FALSE,
                        merge.data = TRUE,
                        merge.dr = FALSE)

# Run sctransform following standard Seurat version 5 workflow
obj.merged.v5 <- SCTransform(object = obj.merged.v5,
                             ncells = ncells,
                             variable.features.n = ngenes,
                             clip.range = c(-sqrt(x = ncells / 30),
                                            sqrt(x = ncells / 30)), # same clipping range for the two approaches
                             n_genes = ngenes)

# Run sctransform using lapply
obj.l <- lapply(X = obj.l,
                FUN = SCTransform,
                ncells = ncells,
                variable.features.n = ngenes,
                clip.range = c(-sqrt(x = ncells / 30),
                               sqrt(x = ncells / 30)), # same clipping range for the two approaches
                n_genes = ngenes)

# Merge Seurat objects
obj.merged.l5 <- merge(x = obj.l$smartseq2,
                       y = obj.l$celseq2,
                       add.cell.ids = NULL,
                       collapse = FALSE,
                       merge.data = TRUE,
                       merge.dr = FALSE)

# The feature.attributes slots do not match between the two approaches
mapply(v5 = obj.merged.v5@assays$SCT@SCTModel.list,
       l5 = obj.merged.l5@assays$SCT@SCTModel.list,
       FUN = function(v5,
                      l5) {
         table(v5@feature.attributes == l5@feature.attributes[rownames(x = v5@feature.attributes),]) # rownames don't match either
       },
       SIMPLIFY = FALSE,
       USE.NAMES = TRUE) # returns TRUE for the first dataset and TRUE and FALSE for the second

# The scale.data slots do not match either between the two approaches
lapply(X = 1:10, # check the first 10 genes
       FUN = function(i,
                      v5,
                      l5) {
         table(v5[i,] == l5[i,colnames(x = v5)])
       },
       v5 = obj.merged.v5@assays$SCT@scale.data,
       l5 = obj.merged.l5@assays$SCT@scale.data)

# The feature.attributes slots match for the lapply approach before and after merging
mapply(l = obj.l,
       l5 = obj.merged.l5@assays$SCT@SCTModel.list,
       FUN = function(l,
                      l5) {
         table(l@assays$SCT@SCTModel.list$counts@feature.attributes == l5@feature.attributes)
       },
       SIMPLIFY = FALSE,
       USE.NAMES = TRUE) # returns TRUE for all datasets

# The scale.data slots match for the lapply approach before and after merging
lapply(X = obj.l,
       FUN = function(l,
                      i,
                      l5) {
         lapply(X = i,
                FUN = function(i,
                               l,
                               l5) {
                  table(l@assays$SCT@scale.data[i,] == l5[i,colnames(x = l@assays$SCT@scale.data)])
                },
                l = l,
                l5 = l5)
       },
       i = 1:10, # check the first 10 genes
       l5 = obj.merged.l5@assays$SCT@scale.data) # returns TRUE for all genes

Here is also my sessionInfo():

> devtools::session_info()
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Monterey 12.7
 system   x86_64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2024-04-11
 rstudio  2023.09.0+463 Desert Sunflower (desktop)
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package          * version    date (UTC) lib source
 abind              1.4-5      2016-07-21 [1] CRAN (R 4.3.0)
 backports          1.4.1      2021-12-13 [1] CRAN (R 4.3.0)
 base64enc          0.1-3      2015-07-28 [1] CRAN (R 4.3.0)
 bbmle              1.0.25.1   2023-12-09 [1] CRAN (R 4.3.0)
 bdsmatrix          1.3-7      2024-03-02 [1] CRAN (R 4.3.2)
 bitops             1.0-7      2021-04-24 [1] CRAN (R 4.3.0)
 cachem             1.0.8      2023-05-01 [1] CRAN (R 4.3.0)
 caTools            1.18.2     2021-03-28 [1] CRAN (R 4.3.0)
 checkmate          2.3.1      2023-12-04 [1] CRAN (R 4.3.0)
 cli                3.6.2      2023-12-11 [1] CRAN (R 4.3.0)
 cluster            2.1.6      2023-12-01 [1] CRAN (R 4.3.0)
 clustree           0.5.1      2023-11-05 [1] CRAN (R 4.3.0)
 codetools          0.2-20     2024-03-31 [1] CRAN (R 4.3.2)
 colorspace         2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 cowplot            1.1.3      2024-01-22 [1] CRAN (R 4.3.2)
 curl               5.2.1      2024-03-01 [1] CRAN (R 4.3.2)
 data.table         1.15.4     2024-03-30 [1] CRAN (R 4.3.2)
 deldir             2.0-4      2024-02-28 [1] CRAN (R 4.3.2)
 densEstBayes       1.0-2.2    2023-03-31 [1] CRAN (R 4.3.0)
 devtools           2.4.5      2022-10-11 [1] CRAN (R 4.3.0)
 digest             0.6.35     2024-03-11 [1] CRAN (R 4.3.2)
 dotCall64          1.1-1      2023-11-28 [1] CRAN (R 4.3.0)
 dplyr              1.1.4      2023-11-17 [1] CRAN (R 4.3.0)
 ellipsis           0.3.2      2021-04-29 [1] CRAN (R 4.3.0)
 evaluate           0.23       2023-11-01 [1] CRAN (R 4.3.0)
 fansi              1.0.6      2023-12-08 [1] CRAN (R 4.3.0)
 farver             2.1.1      2022-07-06 [1] CRAN (R 4.3.0)
 fastDummies        1.7.3      2023-07-06 [1] CRAN (R 4.3.0)
 fastmap            1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
 fitdistrplus       1.1-11     2023-04-25 [1] CRAN (R 4.3.0)
 foreign            0.8-86     2023-11-28 [1] CRAN (R 4.3.0)
 Formula            1.2-5      2023-02-24 [1] CRAN (R 4.3.0)
 fs                 1.6.3      2023-07-20 [1] CRAN (R 4.3.0)
 future             1.33.2     2024-03-26 [1] CRAN (R 4.3.2)
 future.apply       1.11.2     2024-03-28 [1] CRAN (R 4.3.2)
 generics           0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 ggforce            0.4.2      2024-02-19 [1] CRAN (R 4.3.2)
 ggplot2            3.5.0      2024-02-23 [1] CRAN (R 4.3.2)
 ggraph             2.2.1      2024-03-07 [1] CRAN (R 4.3.2)
 ggrepel            0.9.5      2024-01-10 [1] CRAN (R 4.3.0)
 ggridges           0.5.6      2024-01-23 [1] CRAN (R 4.3.2)
 globals            0.16.3     2024-03-08 [1] CRAN (R 4.3.2)
 glue               1.7.0      2024-01-09 [1] CRAN (R 4.3.0)
 goftest            1.2-3      2021-10-07 [1] CRAN (R 4.3.0)
 gplots             3.1.3.1    2024-02-02 [1] CRAN (R 4.3.2)
 graphlayouts       1.1.1      2024-03-09 [1] CRAN (R 4.3.2)
 gridExtra          2.3        2017-09-09 [1] CRAN (R 4.3.0)
 gtable             0.3.4      2023-08-21 [1] CRAN (R 4.3.0)
 gtools             3.9.5      2023-11-20 [1] CRAN (R 4.3.0)
 Hmisc              5.1-2      2024-03-11 [1] CRAN (R 4.3.2)
 htmlTable          2.4.2      2023-10-29 [1] CRAN (R 4.3.0)
 htmltools          0.5.8.1    2024-04-04 [1] CRAN (R 4.3.1)
 htmlwidgets        1.6.4      2023-12-06 [1] CRAN (R 4.3.0)
 httpuv             1.6.15     2024-03-26 [1] CRAN (R 4.3.2)
 httr               1.4.7      2023-08-15 [1] CRAN (R 4.3.0)
 ica                1.0-3      2022-07-08 [1] CRAN (R 4.3.0)
 igraph             2.0.3      2024-03-13 [1] CRAN (R 4.3.2)
 inline             0.3.19     2021-05-31 [1] CRAN (R 4.3.0)
 irlba              2.3.5.1    2022-10-03 [1] CRAN (R 4.3.0)
 jsonlite           1.8.8      2023-12-04 [1] CRAN (R 4.3.0)
 KernSmooth         2.23-22    2023-07-10 [1] CRAN (R 4.3.0)
 knitr              1.45       2023-10-30 [1] CRAN (R 4.3.0)
 later              1.3.2      2023-12-06 [1] CRAN (R 4.3.0)
 lattice            0.22-6     2024-03-20 [1] CRAN (R 4.3.2)
 lazyeval           0.2.2      2019-03-15 [1] CRAN (R 4.3.0)
 leiden             0.4.3.1    2023-11-17 [1] CRAN (R 4.3.0)
 lifecycle          1.0.4      2023-11-07 [1] CRAN (R 4.3.0)
 listenv            0.9.1      2024-01-29 [1] CRAN (R 4.3.2)
 lmtest             0.9-40     2022-03-21 [1] CRAN (R 4.3.0)
 loo                2.7.0      2024-02-24 [1] CRAN (R 4.3.2)
 M3Drop             3.10.6     2023-10-05 [1] Github (tallulandrews/M3Drop@9128921)
 magrittr           2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 MASS               7.3-60.0.1 2024-01-13 [1] CRAN (R 4.3.0)
 Matrix             1.6-5      2024-01-11 [1] CRAN (R 4.3.0)
 matrixStats        1.2.0      2023-12-11 [1] CRAN (R 4.3.0)
 memoise            2.0.1      2021-11-26 [1] CRAN (R 4.3.0)
 mgcv               1.9-1      2023-12-21 [1] CRAN (R 4.3.0)
 mime               0.12       2021-09-28 [1] CRAN (R 4.3.0)
 miniUI             0.1.1.1    2018-05-18 [1] CRAN (R 4.3.0)
 munsell            0.5.1      2024-04-01 [1] CRAN (R 4.3.2)
 mvtnorm            1.2-4      2023-11-27 [1] CRAN (R 4.3.0)
 nlme               3.1-164    2023-11-27 [1] CRAN (R 4.3.0)
 nnet               7.3-19     2023-05-03 [1] CRAN (R 4.3.1)
 numDeriv           2016.8-1.1 2019-06-06 [1] CRAN (R 4.3.0)
 parallelly         1.37.1     2024-02-29 [1] CRAN (R 4.3.2)
 patchwork          1.2.0      2024-01-08 [1] CRAN (R 4.3.0)
 pbapply            1.7-2      2023-06-27 [1] CRAN (R 4.3.0)
 pillar             1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgbuild           1.4.4      2024-03-17 [1] CRAN (R 4.3.2)
 pkgconfig          2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 pkgload            1.3.4      2024-01-16 [1] CRAN (R 4.3.0)
 plotly             4.10.4     2024-01-13 [1] CRAN (R 4.3.0)
 plyr               1.8.9      2023-10-02 [1] CRAN (R 4.3.0)
 png                0.1-8      2022-11-29 [1] CRAN (R 4.3.0)
 polyclip           1.10-6     2023-09-27 [1] CRAN (R 4.3.0)
 profvis            0.3.8      2023-05-02 [1] CRAN (R 4.3.0)
 progressr          0.14.0     2023-08-10 [1] CRAN (R 4.3.0)
 promises           1.2.1      2023-08-10 [1] CRAN (R 4.3.0)
 purrr              1.0.2      2023-08-10 [1] CRAN (R 4.3.0)
 QuickJSR           1.1.3      2024-01-31 [1] CRAN (R 4.3.2)
 R6                 2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 RANN               2.6.1      2019-01-08 [1] CRAN (R 4.3.0)
 RColorBrewer       1.1-3      2022-04-03 [1] CRAN (R 4.3.0)
 Rcpp               1.0.12     2024-01-09 [1] CRAN (R 4.3.0)
 RcppAnnoy          0.0.22     2024-01-23 [1] CRAN (R 4.3.2)
 RcppHNSW           0.6.0      2024-02-04 [1] CRAN (R 4.3.2)
 RcppParallel       5.1.7      2023-02-27 [1] CRAN (R 4.3.0)
 reldist            1.7-2      2023-02-17 [1] CRAN (R 4.3.0)
 remotes            2.5.0      2024-03-17 [1] CRAN (R 4.3.2)
 reshape2           1.4.4      2020-04-09 [1] CRAN (R 4.3.0)
 reticulate         1.35.0     2024-01-31 [1] CRAN (R 4.3.2)
 rlang              1.1.3      2024-01-10 [1] CRAN (R 4.3.0)
 rmarkdown          2.26       2024-03-05 [1] CRAN (R 4.3.2)
 ROCR               1.0-11     2020-05-02 [1] CRAN (R 4.3.0)
 rpart              4.1.23     2023-12-05 [1] CRAN (R 4.3.0)
 RSpectra           0.16-1     2022-04-24 [1] CRAN (R 4.3.0)
 rstan              2.32.6     2024-03-05 [1] CRAN (R 4.3.2)
 rstantools         2.4.0      2024-01-31 [1] CRAN (R 4.3.2)
 rstudioapi         0.16.0     2024-03-24 [1] CRAN (R 4.3.2)
 Rtsne              0.17       2023-12-07 [1] CRAN (R 4.3.0)
 scales             1.3.0      2023-11-28 [1] CRAN (R 4.3.0)
 scattermore        1.2        2023-06-12 [1] CRAN (R 4.3.0)
 sctransform        0.4.1      2023-10-19 [1] CRAN (R 4.3.0)
 sessioninfo        1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 Seurat           * 5.0.3      2024-03-18 [1] CRAN (R 4.3.2)
 SeuratObject     * 5.0.1      2023-11-17 [1] CRAN (R 4.3.0)
 shiny              1.8.1.1    2024-04-02 [1] CRAN (R 4.3.1)
 sp               * 2.1-3      2024-01-30 [1] CRAN (R 4.3.2)
 spam               2.10-0     2023-10-23 [1] CRAN (R 4.3.0)
 spatstat.data      3.0-4      2024-01-15 [1] CRAN (R 4.3.0)
 spatstat.explore   3.2-7      2024-03-21 [1] CRAN (R 4.3.2)
 spatstat.geom      3.2-9      2024-02-28 [1] CRAN (R 4.3.2)
 spatstat.random    3.2-3      2024-02-29 [1] CRAN (R 4.3.2)
 spatstat.sparse    3.0-3      2023-10-24 [1] CRAN (R 4.3.0)
 spatstat.utils     3.0-4      2023-10-24 [1] CRAN (R 4.3.0)
 StanHeaders        2.32.6     2024-03-01 [1] CRAN (R 4.3.2)
 statmod            1.5.0      2023-01-06 [1] CRAN (R 4.3.0)
 stringi            1.8.3      2023-12-11 [1] CRAN (R 4.3.0)
 stringr            1.5.1      2023-11-14 [1] CRAN (R 4.3.0)
 survival           3.5-8      2024-02-14 [1] CRAN (R 4.3.2)
 tensor             1.5        2012-05-05 [1] CRAN (R 4.3.0)
 tibble             3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidygraph          1.3.1      2024-01-30 [1] CRAN (R 4.3.2)
 tidyr              1.3.1      2024-01-24 [1] CRAN (R 4.3.2)
 tidyselect         1.2.1      2024-03-11 [1] CRAN (R 4.3.2)
 tweenr             2.0.3      2024-02-26 [1] CRAN (R 4.3.2)
 urlchecker         1.0.1      2021-11-30 [1] CRAN (R 4.3.0)
 usethis            2.2.3      2024-02-19 [1] CRAN (R 4.3.2)
 utf8               1.2.4      2023-10-22 [1] CRAN (R 4.3.0)
 uwot               0.1.16     2023-06-29 [1] CRAN (R 4.3.0)
 V8                 4.4.2      2024-02-15 [1] CRAN (R 4.3.2)
 vctrs              0.6.5      2023-12-01 [1] CRAN (R 4.3.0)
 viridis            0.6.5      2024-01-29 [1] CRAN (R 4.3.2)
 viridisLite        0.4.2      2023-05-02 [1] CRAN (R 4.3.0)
 withr              3.0.0      2024-01-16 [1] CRAN (R 4.3.0)
 xfun               0.43       2024-03-25 [1] CRAN (R 4.3.2)
 xtable             1.8-4      2019-04-21 [1] CRAN (R 4.3.0)
 zoo                1.8-12     2023-04-13 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library

Best, Leon

NeuralBind commented 6 months ago

Well, if you SCTransform each individual dataset and then merge it shouldnt have the same output as if you SCT all your datasets together. You have different data input when you conduct the normilization. Except if you merge but still individualy SCT each layer which should have the same output as before merging each one.

leonfodoulian commented 6 months ago

Hi @NeuralBind,

The code I provided above splits the merged object before computing Seurat::SCTransform(). Therefore, Seurat::SCTransform() should run on each dataset individually. These two approaches should yield the same results. And as a matter of fact, if the datasets contain the same genes, the results match between the two approaches. This issue is due to the merge function of Seurat mishandling data with different sets of genes.

Best, Leon

NeuralBind commented 6 months ago

Well i first tried it couple day ago, i merged two SCT transformed datasets (it kept the each single layer SCT$count1, SCT$count2 etc.), but by the time you run SCT again it transforms the whole dataset (and not individualy) and overwrites with a single layer for the whole set. Which is something you want in order to visualize the variations of similar groups for different conditions etc. You can try to play a bit with the Join layers function (for joining or spliting) and look what SCT does to the merged set. In the vignettes this part was a bit trivial so i just trial an error to see the function. For the different gene set it shouldnt be a problem as it will add more genes but they will be no expressed in a part of your dataset, so they will be transformed based on that.

leonfodoulian commented 6 months ago

Hi @NeuralBind,

I think you misunderstood the problem. Let us wait for an answer from the Seurat team regarding this issue.

Best, Leon