Integration loses specificity when 1 dataset is a subset of another.

I've been having trouble integrating such datasets:

dataset 1: contains m cell types
dataset 2: contains n out of these m cell types. I would expect that the n cell types in dataset 2 to match the same cell types in dataset 1, and the rest m-n cell types unique to dataset 1 will only have cells from dataset 1. This issue can be largely ignored when all cell types are distinct (scenario 1), but becomes a hard problem when dealing with a developmental trajectory (scenario 2).

Scenario 1: the ifnb dataset from Seurat. first, integration using full datasets.

LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
  x <- ScaleData(x)
  x <- RunPCA(x)
  x <- RunUMAP(object = x, dims = 1:10)
  x <- FindNeighbors(x, dims = 1:10)
  x <- FindClusters(x, resolution = 0.5)
  return(x)
})

features <- SelectIntegrationFeatures(object.list = ifnb.list)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
immune.combined <- IntegrateData(anchorset = immune.anchors)

DefaultAssay(immune.combined) <- "integrated"

immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)

cells.sub <- colnames(immune.combined)[immune.combined$seurat_clusters %in% c("0", "3", "8")]

immune.combined %>% DimPlot(cells.highlight = cells.sub, split.by = "stim") +
  ggsave("int_debug/plots/ifnb_post_highlight.png", units = "in", dpi = 150, width = 10, height = 5, device = "png")

The resulting plot: ifnb_post_highlight

Next, we subset the second seurat object in ifnb.list (stimulated one), while keeping the entire first object

ifnb.list.partial <- ifnb.list
ifnb.list.partial$STIM <- ifnb.list.partial$STIM[, cells.sub]

ifnb.list.partial <- lapply(X = ifnb.list.partial, FUN = function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
  return(x)
})

features <- SelectIntegrationFeatures(object.list = ifnb.list.partial)
immune.partial.anchors <- FindIntegrationAnchors(object.list = ifnb.list.partial, anchor.features = features)
# this command creates an 'integrated' data assay
immune.partial.combined <- IntegrateData(anchorset = immune.partial.anchors)

# specify that we will perform downstream analysis on the corrected data note that the original
# unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.partial.combined) <- "integrated"

# Run the standard workflow for visualization and clustering
immune.partial.combined <- ScaleData(immune.partial.combined, verbose = FALSE)
immune.partial.combined <- RunPCA(immune.partial.combined, npcs = 30, verbose = FALSE)
immune.partial.combined <- RunUMAP(immune.partial.combined, reduction = "pca", dims = 1:30)
immune.partial.combined <- FindNeighbors(immune.partial.combined, reduction = "pca", dims = 1:30)
immune.partial.combined <- FindClusters(immune.partial.combined, resolution = 0.5)

immune.partial.combined %>% DimPlot(cells.highlight = cells.sub, split.by = "stim", pt.size = 0.05) +
  ggsave("int_debug/plots/ifnb_post_partial_highlight.png", units = "in", dpi = 150, width = 10, height = 5, device = "png")

The resulting plot: ifnb_post_partial_highlight

As you can see, Seurat did "mis-assign" some of the cells, but the majority of the cells are correctly matched.

Scenario 2 This is my own dataset. In this developmental system, the stem cells will differentiate into 3 different trajectories, as annotated in the figure below (for sample 1). The stem cell populations in these 2 samples:

sol[[1]]$stem.cells <- NA
sol[[1]]$stem.cells[sol[[1]]$seurat_clusters %in% c("1", "17")] <- "stem.cells"
p.full.pre.1 <- DimPlot(sol[[1]], cells.highlight = colnames(sol[[1]])[sol[[1]]$stem.cells == "stem.cells"]) + 
  ggtitle("sample.1. full. pre-integration")

sol[[2]]$stem.cells <- NA
sol[[2]]$stem.cells[sol[[2]]$seurat_clusters %in% c("0", "13")] <- "stem.cells"
p.full.pre.2 <- DimPlot(sol[[2]], cells.highlight = colnames(sol[[2]])[sol[[2]]$stem.cells == "stem.cells"]) + 
  ggtitle("sample.2. full. pre-integration")

I integrated my 2 samples:

# sol: a list of seurat objects. (2 samples)
int.features <- SelectIntegrationFeatures(object.list = sol, 
                                            assay = c("SCT", "SCT"))
sol <- PrepSCTIntegration(object.list = sol, anchor.features = int.features, assay = c("SCT", "SCT"), verbose = T)

anchors <- FindIntegrationAnchors(object.list = sol, normalization.method = "SCT",
                                    assay = c("SCT", "SCT"),
                                    anchor.features = int.features, verbose = T)
soi <- IntegrateData(anchorset = anchors, normalization.method = "SCT",
                       verbose = T)
soi <- RunPCA(soi, npcs = 30, assay =  "integrated")
soi <- RunUMAP(soi, reduction = "pca", dims = 1:30, assay =  "integrated")

The stem cell populations matched very well through integration:

p.full.post <- DimPlot(soi, cells.highlight = colnames(soi)[soi$stem.cells == "stem.cells"], split.by = "sample") + 
  ggtitle("full. post-integration")

Now, I subset the sample.2 to only contain stem cells, and repeat integration:

so2.partial <- sol[[2]][, sol[[2]]$stem.cells == "stem.cells"]
so2.partial <- SCTransform(so2.partial, assay = "RNA", do.correct.umi = T,
                           verbose = T)

l <- list(sample.1.full = sol[[1]], sample.2.partial = so2.partial)

t.int.features <- SelectIntegrationFeatures(object.list = l, 
                                          assay = c("SCT", "SCT"))
l <- PrepSCTIntegration(object.list = l, anchor.features = t.int.features, assay = c("SCT", "SCT"), verbose = T)

t.anchors <- FindIntegrationAnchors(object.list = l, normalization.method = "SCT",
                                  assay = c("SCT", "SCT"),
                                  anchor.features = t.int.features, verbose = T)
soi.2 <- IntegrateData(anchorset = t.anchors, normalization.method = "SCT",
                       verbose = T)
soi.2 <- RunPCA(soi.2, npcs = 30, assay =  "integrated")
soi.2 <- RunUMAP(soi.2, reduction = "pca", dims = 1:30, assay =  "integrated")
soi.2 <- FindNeighbors(soi.2, reduction = "pca", dims = 1:30)
soi.2 <- FindClusters(soi.2, resolution = 1.2)

p.partial.post <- DimPlot(soi.2, cells.highlight = colnames(soi.2)[soi.2$stem.cells == "stem.cells"], split.by = "sample") + 
  ggtitle("partial. post-integration")

As you can see, the stem cell population is now mostly scattered across all populations. This is also shown in the plot below. sample.2 doesn't even concentrate at cluster 3 (where stem cells are supposed to be at)

I would highly appreciate it if you could offer help with your expertise. What are the parameters I need to tune to make the integration better? We have more datasets where only the stem cells were sorted and sequenced. I hope to integrate these samples into this full developmental landscape.

Best regards.

satijalab / seurat

Integration loses specificity when 1 dataset is a subset of another. #4094