satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

Integration loses specificity when 1 dataset is a subset of another. #4094

Closed fanc-WU closed 3 years ago

fanc-WU commented 3 years ago

I've been having trouble integrating such datasets:

Scenario 1: the ifnb dataset from Seurat. first, integration using full datasets.

LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
  x <- ScaleData(x)
  x <- RunPCA(x)
  x <- RunUMAP(object = x, dims = 1:10)
  x <- FindNeighbors(x, dims = 1:10)
  x <- FindClusters(x, resolution = 0.5)
  return(x)
})

features <- SelectIntegrationFeatures(object.list = ifnb.list)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
immune.combined <- IntegrateData(anchorset = immune.anchors)

DefaultAssay(immune.combined) <- "integrated"

immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)

cells.sub <- colnames(immune.combined)[immune.combined$seurat_clusters %in% c("0", "3", "8")]

immune.combined %>% DimPlot(cells.highlight = cells.sub, split.by = "stim") +
  ggsave("int_debug/plots/ifnb_post_highlight.png", units = "in", dpi = 150, width = 10, height = 5, device = "png")

The resulting plot: ifnb_post_highlight

Next, we subset the second seurat object in ifnb.list (stimulated one), while keeping the entire first object

ifnb.list.partial <- ifnb.list
ifnb.list.partial$STIM <- ifnb.list.partial$STIM[, cells.sub]

ifnb.list.partial <- lapply(X = ifnb.list.partial, FUN = function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
  return(x)
})

features <- SelectIntegrationFeatures(object.list = ifnb.list.partial)
immune.partial.anchors <- FindIntegrationAnchors(object.list = ifnb.list.partial, anchor.features = features)
# this command creates an 'integrated' data assay
immune.partial.combined <- IntegrateData(anchorset = immune.partial.anchors)

# specify that we will perform downstream analysis on the corrected data note that the original
# unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.partial.combined) <- "integrated"

# Run the standard workflow for visualization and clustering
immune.partial.combined <- ScaleData(immune.partial.combined, verbose = FALSE)
immune.partial.combined <- RunPCA(immune.partial.combined, npcs = 30, verbose = FALSE)
immune.partial.combined <- RunUMAP(immune.partial.combined, reduction = "pca", dims = 1:30)
immune.partial.combined <- FindNeighbors(immune.partial.combined, reduction = "pca", dims = 1:30)
immune.partial.combined <- FindClusters(immune.partial.combined, resolution = 0.5)

immune.partial.combined %>% DimPlot(cells.highlight = cells.sub, split.by = "stim", pt.size = 0.05) +
  ggsave("int_debug/plots/ifnb_post_partial_highlight.png", units = "in", dpi = 150, width = 10, height = 5, device = "png")

The resulting plot: ifnb_post_partial_highlight

As you can see, Seurat did "mis-assign" some of the cells, but the majority of the cells are correctly matched.

Scenario 2 This is my own dataset. In this developmental system, the stem cells will differentiate into 3 different trajectories, as annotated in the figure below (for sample 1). image The stem cell populations in these 2 samples:

sol[[1]]$stem.cells <- NA
sol[[1]]$stem.cells[sol[[1]]$seurat_clusters %in% c("1", "17")] <- "stem.cells"
p.full.pre.1 <- DimPlot(sol[[1]], cells.highlight = colnames(sol[[1]])[sol[[1]]$stem.cells == "stem.cells"]) + 
  ggtitle("sample.1. full. pre-integration")

sol[[2]]$stem.cells <- NA
sol[[2]]$stem.cells[sol[[2]]$seurat_clusters %in% c("0", "13")] <- "stem.cells"
p.full.pre.2 <- DimPlot(sol[[2]], cells.highlight = colnames(sol[[2]])[sol[[2]]$stem.cells == "stem.cells"]) + 
  ggtitle("sample.2. full. pre-integration")

image

I integrated my 2 samples:

# sol: a list of seurat objects. (2 samples)
int.features <- SelectIntegrationFeatures(object.list = sol, 
                                            assay = c("SCT", "SCT"))
sol <- PrepSCTIntegration(object.list = sol, anchor.features = int.features, assay = c("SCT", "SCT"), verbose = T)

anchors <- FindIntegrationAnchors(object.list = sol, normalization.method = "SCT",
                                    assay = c("SCT", "SCT"),
                                    anchor.features = int.features, verbose = T)
soi <- IntegrateData(anchorset = anchors, normalization.method = "SCT",
                       verbose = T)
soi <- RunPCA(soi, npcs = 30, assay =  "integrated")
soi <- RunUMAP(soi, reduction = "pca", dims = 1:30, assay =  "integrated")

The stem cell populations matched very well through integration:

p.full.post <- DimPlot(soi, cells.highlight = colnames(soi)[soi$stem.cells == "stem.cells"], split.by = "sample") + 
  ggtitle("full. post-integration")

image Now, I subset the sample.2 to only contain stem cells, and repeat integration:

so2.partial <- sol[[2]][, sol[[2]]$stem.cells == "stem.cells"]
so2.partial <- SCTransform(so2.partial, assay = "RNA", do.correct.umi = T,
                           verbose = T)

l <- list(sample.1.full = sol[[1]], sample.2.partial = so2.partial)

t.int.features <- SelectIntegrationFeatures(object.list = l, 
                                          assay = c("SCT", "SCT"))
l <- PrepSCTIntegration(object.list = l, anchor.features = t.int.features, assay = c("SCT", "SCT"), verbose = T)

t.anchors <- FindIntegrationAnchors(object.list = l, normalization.method = "SCT",
                                  assay = c("SCT", "SCT"),
                                  anchor.features = t.int.features, verbose = T)
soi.2 <- IntegrateData(anchorset = t.anchors, normalization.method = "SCT",
                       verbose = T)
soi.2 <- RunPCA(soi.2, npcs = 30, assay =  "integrated")
soi.2 <- RunUMAP(soi.2, reduction = "pca", dims = 1:30, assay =  "integrated")
soi.2 <- FindNeighbors(soi.2, reduction = "pca", dims = 1:30)
soi.2 <- FindClusters(soi.2, resolution = 1.2)

p.partial.post <- DimPlot(soi.2, cells.highlight = colnames(soi.2)[soi.2$stem.cells == "stem.cells"], split.by = "sample") + 
  ggtitle("partial. post-integration")

image

As you can see, the stem cell population is now mostly scattered across all populations. This is also shown in the plot below. sample.2 doesn't even concentrate at cluster 3 (where stem cells are supposed to be at) image

I would highly appreciate it if you could offer help with your expertise. What are the parameters I need to tune to make the integration better? We have more datasets where only the stem cells were sorted and sequenced. I hope to integrate these samples into this full developmental landscape.

Best regards.

jaisonj708 commented 3 years ago

You may try using RPCA rather than CCA as it is a more conservative integration.

Even then, the integration methods assume that there is sufficient similarity between your samples. In your case, a majority of the axes of variation in one of your samples (all cells) will not be present in your other sample (stem cells only). This can result in somewhat poorer integration though I have yet to see a case where the difference is as drastic as this.

Feel free to send your dataset to seuratpackage@gmail.com for further guidance and assistance.