satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 915 forks source link

Challenges with Label Transfer from Batch-Corrected Reference (FastMNN) : Prevalence of Dominant Cell Type #7734

Closed ShineYourL1ght closed 4 months ago

ShineYourL1ght commented 1 year ago

I'm encountering difficulties while attempting to execute the 'FindTransferAnchors' and 'TransferData' functions from the Seurat package.

I performed data integration and batch correction on a dataset containing 11 snRNA-snATAC multiome samples labeled as Vm0 to Vm10 (with only the RNA assay selected). The goal was to transfer labels onto my snRNA-seq dataset, which is assigned as D. After preprocessing and eliminating doublets, I conducted integration using Seurat and performed batch correction using FastMNN, assigned as V.

However, I'm facing an issue where a specific cell type(Sertoli) appears to have the highest score in over 99% of the cells. I'm uncertain about the correctness of my code, especially considering that most available vignettes do not utilize MNN for reduction. I'm wondering if downsampling could potentially resolve the current issue.

Additionally, I'm curious if there's a different protocol or method to follow when using batch-corrected samples as a reference for label transfer. I would greatly appreciate your assistance in addressing these matters.

For context, D_test$new_Ident represents the initial cell type annotation for my samples. Idents(V) represents the initial cell type annotation for my the FastMNN-corrected reference.

table(D_test$new_Ident)

      Sertoli Peritubular Myoid         Unknown 1         Unknown 2            Leydig         Unknown 3       Endothelial        Macrophage            Immune 
         8268              1512              1140               852               418               217               130                77                41

table(Idents(V))

Sertoli Granulosa 1         FGC      Leydig Granulosa 2        Meso  Macrophage        Endo    Endo/Epi     Unknown     Myeloid 
   5833        7706       11955        8296        4076        2222         673         577         539         370         133 

table(D_test$predicted.id)

   FGC Macrophage       Meso    Sertoli 
    45         46          1      12563 

Would you revise the following codes?

Label transfer after manual annotation

new.cluster.ids <- c("Sertoli", "Granulosa 1", "FGC", "Leydig", "Granulosa 2", "FGC", "Leydig", "FGC", "FGC","Meso","Granulosa 1","Granulosa 1","FGC","Granulosa 2","Leydig","Leydig","Macrophage","Endo", "Meso", "Endo", "Leydig", "FGC", "Unknown", "Sertoli", "Sertoli", "FGC", "Myeloid", "Macrophage")
names(new.cluster.ids) <- levels(V)
V <- RenameIdents(V, new.cluster.ids)
DefaultAssay(V) <- "mnn.reconstructed"

anchors_30  <- FindTransferAnchors(
  reference = V,
  query = D,
  dims = 1:30,
  reference.reduction = "mnn"
)

predict_mnn <- TransferData(anchorset = anchors_30,
                            refdata = V@active.ident, )
D_test <- D
D_test <- AddMetaData(D_test, metadata = predict_mnn)
p0 <- DimPlot(D_test, group.by = "predicted.id", label = T, label.size = 4)

Code for pre-processing, doublet-removal, FastMNN is below.

Load data

data_dir_0 = "../..."
n0 = "..."
Vm0.data <- Read10X(data.dir = data_dir_0)
Vm0.rna <- Vm0.data$`Gene Expression`
Vm0 <- CreateSeuratObject(counts = Vm0.rna, project = "Vm0",  min.cells = 3, min.features = 200)

text1 <- "Repeat for Vm1, ..., Vm10"

Pre-processing

.ub <- function(x, p) {
  quantile(x[["nFeature_RNA"]]$nFeature_RNA, probs = p)
}

.pp <- function(s, p=0.97, d = 25, n = 3000, r = 1.0){
  s[["percent.mt"]] <- PercentageFeatureSet(s, pattern = "^MT-")
  s<- subset(s, subset = nFeature_RNA > 200 & nFeature_RNA < .ub(s, p)  & percent.mt < 10)
  DefaultAssay(s) <- "RNA"
  s <- NormalizeData(s)
  s <- FindVariableFeatures(s, selection.method = "vst", nfeatures = n)
  s <- ScaleData(s)
  s <- RunPCA(s)
  s <- FindNeighbors(s, dims = 1:d)
  s <- FindClusters(s, resolution = r, verbose = FALSE)
  s <- RunUMAP(s, dims = 1:d)
  return(s)
}

variable_names <- paste0("Vm", 0:10)
for (var_name in variable_names) {
  assign(var_name, .pp(get(var_name)))
}

DoubletFinder()

path1 = "..."

drm_labels <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
cell_ids<- paste0('Vm', drm_labels)
start_time <- system.time({
  for (i in seq_along(cell_ids)) {
    print(paste0("This is ", i, "th object Doublet Removal"))
    objd <- get(cell_ids[i])
    table(Idents(objd))
    objd$seurat_clusters0 <- Idents(objd)
    annotation <- objd@meta.data$seurat_clusters0
    homotypic.prop <- modelHomotypic(annotation)
    homotypic.prop
    nExp_poi <- round(0.107069171*length(objd@active.ident))
    nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop))
    print(i)
    print(nExp_poi)
    print(nExp_poi.adj)  

    sweep.res.list <- paramSweep_v3(objd, PCs = 1:10, sct = FALSE)
    sweep.stats <- summarizeSweep(sweep.res.list, GT = FALSE)
    bcmvn <- find.pK(sweep.stats)
    bcmvn.max <- bcmvn[which.max(bcmvn$BCmetric),]
    optimal.pk <- bcmvn.max$pK
    optimal.pk <- as.numeric(levels(optimal.pk))[optimal.pk]
    print(i)
    print(optimal.pk)

    objd <- doubletFinder_v3(objd, PCs = 1:10, pN = 0.25, pK = optimal.pk, nExp = nExp_poi, reuse.pANN = FALSE, sct = FALSE)
    pANN_reuse <- paste0("pANN_0.25_", optimal.pk, "_", nExp_poi)
    gb <- paste0("DF.classifications_0.25_", optimal.pk, "_", nExp_poi.adj)
    objd <- doubletFinder_v3(objd, PCs = 1:10, pN = 0.25, pK = optimal.pk, nExp = nExp_poi.adj, reuse.pANN = pANN_reuse, sct = FALSE)

    Vm_var_name <- paste0("Vm_r_10_", drm_labels[i])
    print(Vm_var_name)
    Vm_before_removal_var_name <- paste0("Vm_br_10_", drm_labels[i])
    print(Vm_before_removal_var_name)
    assign(Vm_before_removal_var_name, objd)
    assign(Vm_var_name, objd[, objd@meta.data[[gb]] == 'Singlet'])
    print(paste0(i," th object Doublet Removal has been completed."))
    filepath = paste0(path1, "/", cell_ids[i], ".rds")
    saveRDS(get(Vm_var_name), file = filepath)
    print(paste0("Saving", i, "th object after Doublet Removal has been completed."))
  }
})
print(paste("Total runtime:", format(start_time["elapsed"], digits=2), "seconds"))
print("Task has done.")

path0 = "..." # path for .rds file (which is fully pre-processed & doublet removed)
for (var_name in variable_names) {
  assign(var_name, readRDS(paste0(path0, "/", var_name, ".rds")))
}

Merge & FastMNN


Vmc <- merge(
  x = Vm0,
  y = c(Vm1, Vm2, Vm3, Vm4, Vm5, Vm6, Vm7, Vm8, Vm9, Vm10),
  add.cell.ids = cell_ids,
  project = "Vm"
)
.pp2 <- funciton(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 3000)
  x <- ScaleData(x)
  x <- RunPCA(x)
  x <- FindNeighbors(x, dims = 1:25)
  x <- FindClusters(x, resolution = 1.0, verbose = FALSE)
  x <- RunUMAP(x, dims = 1:25)
  return(x)
}
Vmc_25_1.0 <- .pp2(Vmc)

object_list <- list(Vm0,Vm1,Vm2,Vm3,Vm4,Vm5,Vm6,Vm7,Vm8,Vm9,Vm10)
id.list <- list("0","1","2","3","4","5","6","7","8","9","10")

Vc <- Merge_Seurat_List(
  list_seurat = object_list,
  add.cell.ids = id.list,
  merge.data = TRUE,
  project = "SeuratProject"
)

Vcm <- readRDS("~/scRNA-ATAC multiome/RDS/Vcm.rds")

Vcm@assays$RNA@scale.data <- as.matrix(0)

Vcm.list <- SplitObject(Vcm,split.by="sample_id")

for (i in 1:length(Vcm.list)) {
  Vcm.list[[i]] <- NormalizeData(Vcm.list[[i]],assay="RNA")
  Vcm.list[[i]] <- FindVariableFeatures(Vcm.list[[i]], selection.method="vst", nfeatures=2000)
}

.mnn <- function(list){
  y <- RunFastMNN(object.list = list)
  y <- RunUMAP(y, reduction = "mnn", dims = 1:30)
  y <- FindNeighbors(y, reduction = "mnn", dims = 1:30)
  y <- FindClusters(y)
  return(y)
}

V <- .mnn(Vcm.list)

#> The current automatic grid maker returned an error when called on 'x'. Please use setAutoGridMaker() to set a valid automatic grid maker.
#> Warning: Keys should be one or more alphanumeric characters followed by an underscore, setting key from mnn.reconstructed_ to mnnreconstructed_
dcollins15 commented 4 months ago

Thanks for using Seurat!

It appears that this issue has gone stale. In an effort to keep our Issues board from getting more unruly than it already is, we’re going to begin closing out issues that haven’t had any activity since the release of v4.4.0.

If this issue is still relevant we strongly encourage you to reopen or repost it, especially if you didn’t initially receive a response from us.