satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.28k stars 912 forks source link

joinlayers does not join scale.data #9172

Closed asmlgkj closed 1 month ago

asmlgkj commented 3 months ago

Thanks a lot. I am learning seurat v5.

rm(list = ls())
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
library(Azimuth)
library(ggplot2)
library(patchwork)
options(future.globals.maxSize = 1e9)

# load in the pbmc systematic comparative analysis dataset
obj <- LoadData("pbmcsca")
obj <- subset(obj, nFeature_RNA > 1000)
obj <- RunAzimuth(obj, reference = "pbmcref")
# currently, the object has two layers in the RNA assay: counts, and data
obj

obj[["RNA"]] <- split(obj[["RNA"]], f = obj$Method)
obj

obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)

obj <- FindNeighbors(obj, dims = 1:30, reduction = "pca")
obj <- FindClusters(obj, resolution = 2, cluster.name = "unintegrated_clusters")

obj <- RunUMAP(obj, dims = 1:30, reduction = "pca", reduction.name = "umap.unintegrated")
# visualize by batch and cell type annotation
# cell type annotations were previously added by Azimuth
DimPlot(obj, reduction = "umap.unintegrated", group.by = c("Method", "predicted.celltype.l2"))

obj <- IntegrateLayers(
  object = obj, method = HarmonyIntegration,
  orig.reduction = "pca", new.reduction = "harmony",
  verbose = FALSE
)

obj2 <- JoinLayers(obj)

The above code show layers as follwong. I am wondering why scale.data was not for each sample image

when I used the similar code to analyze my onwn data, I found as follwing. It has many scale.data, including for each sample, where I am wrong

72c717e14d451012536582983f785f80

samuel-marsh commented 3 months ago

Hi @asmlgkj,

Can you post the full code that you are running with your samples?

The data should look like it does in the example data. The reason is because of what ScaleData is doing and why you run it (creating scaled expression across all cells). It would not be valid to combine scaled data from multiple samples into a new matrix just by combining them because they need to be scaled together. I'm not sure how you ended up with multiple layers like that as following the vignette code I cannot replicate what you are seeing with your own data.

Best, Sam

asmlgkj commented 2 months ago

thanks a lot, this is the code

rm(list = ls()) Sys.setenv(R_MAX_NUM_DLLS=999) gc() base_dir <- '/home/huyu' log_file <- "log.txt" sink(log_file, append = TRUE, split = TRUE)

cat(paste0("\n", Sys.time(), " - Starting analysis\n")) suppressPackageStartupMessages({ library(Seurat) library(tidyselect) library(tidyverse) library(dplyr) library(ggpubr) library(DoubletFinder) library(crayon) })

base_dir <- '/home/huyu' dir1_1 <- base_dir qc_dir <- file.path(base_dir, "1_QC") dir.create(qc_dir, showWarnings = FALSE, recursive = TRUE) setwd(qc_dir)

1.1 Run QC

sample_name1 <- c("sample1", "sample2", "sample3", "sample4", "sample6")

scRNAlist <- list() qc_before <- list() qc_after <- list()

basic_qc <- function(input_sce, sampleid) {

Mitochondrial genes

mito_genes <- rownames(input_sce)[grep("^MT-", rownames(input_sce), ignore.case = TRUE)]

print(mito_genes)

input_sce <- PercentageFeatureSet(input_sce, features = mito_genes, col.name = "percent_mito") fivenum(input_sce@meta.data$percent_mito)

Ribosomal genes

ribo_genes <- rownames(input_sce)[grep("^Rp[sl]", rownames(input_sce), ignore.case = TRUE)]

print(ribo_genes)

input_sce <- PercentageFeatureSet(input_sce, features = ribo_genes, col.name = "percent_ribo") fivenum(input_sce@meta.data$percent_ribo)

Hemoglobin genes

Hb_genes <- rownames(input_sce)[grep("^Hb[^(p)]", rownames(input_sce), ignore.case = TRUE)]

print(Hb_genes)

input_sce <- PercentageFeatureSet(input_sce, features = Hb_genes, col.name = "percent_hb") fivenum(input_sce@meta.data$percent_hb)

Plot nFeature and nCount

feats <- c("nFeature_RNA", "nCount_RNA") p1 <- VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + NoLegend() p1 w <- length(unique(input_sce$orig.ident)) / 3 + 5 ggsave(filename = paste0("Vlnplot_pre_qc_nFeaturenCount", sampleid, ".pdf"), plot = p1, width = w, height = 5)

Plot percentages

feats <- c("percent_mito", "percent_ribo", "percent_hb") p2 <- VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3, same.y.lims = TRUE) + scale_y_continuous(breaks = seq(0, 100, 5)) + NoLegend() p2
w <- length(unique(input_sce$orig.ident)) / 2 + 5 ggsave(filename = paste0("Vlnplot_pre_qc_mito_ribohb", sampleid, ".pdf"), plot = p2, width = w, height = 5)

Scatter plot

p3 <- FeatureScatter(input_sce, "nCount_RNA", "nFeatureRNA", group.by = "orig.ident", pt.size = 0.5) ggsave(filename = paste0("Scatterplot", sampleid, ".pdf"), plot = p3)

input_sce.filt <- input_sce

Top 50 most expressed genes

C <- subset(input_sce.filt, downsample = 100)@assays$RNA$counts C <- Matrix::t(Matrix::t(C) / Matrix::colSums(C)) * 100 most_expressed <- order(apply(C, 1, median), decreasing = TRUE)[50:1] pdf(paste0("TOP50_most_expressedgene", sampleid, ".pdf"), width = 14) boxplot(as.matrix(Matrix::t(C[most_expressed, ])), cex = 0.1, las = 1, xlab = "% total count per cell", col = (scales::hue_pal())(50)[50:1], horizontal = TRUE) dev.off() rm(C)

Cell selection based on criteria

selected_mito <- WhichCells(input_sce.filt, expression = percent_mito < 25) selected_ribo <- WhichCells(input_sce.filt, expression = percent_ribo > 3) selected_hb <- WhichCells(input_sce.filt, expression = percent_hb < 1) selected_exp <- WhichCells(input_sce.filt, expression = nFeature_RNA > 200 & nCount_RNA > 500)

Subsetting

input_sce.filt <- subset(input_sce.filt, cells = selected_mito) input_sce.filt <- subset(input_sce.filt, cells = selected_ribo) input_sce.filt <- subset(input_sce.filt, cells = selected_hb) input_sce.filt <- subset(input_sce.filt, cells = selected_exp)

Conditional block (currently not executed)

if (FALSE) { selected_f <- rownames(input_sce)[Matrix::rowSums(input_sce@assays$RNA$counts > 0) > 3] input_sce.filt <- subset(input_sce, cells = selected_c) dim(input_sce) dim(input_sce.filt) }

Final plots

feats <- c("nFeature_RNA", "nCount_RNA") p1_filtered <- VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + NoLegend() w <- length(unique(input_sce.filt$orig.ident)) / 3 + 5 ggsave(filename = paste0("Vlnplot_after_qc_nFeaturenCount", sampleid, ".pdf"), plot = p1_filtered, width = w, height = 5)

feats <- c("percent_mito", "percent_ribo", "percent_hb") p2_filtered <- VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3) + NoLegend() w <- length(unique(input_sce.filt$orig.ident)) / 2 + 5 ggsave(filename = paste0("Vlnplot_after_qc_mito_ribohb", sampleid, ".pdf"), plot = p2_filtered, width = w, height = 5)

return(input_sce.filt) }

for (i in 1:length(sample_name1)) { cat("Processing subset", i, "of", sample_name1[i], "\n") current_dir <- file.path(dir1_1, sample_name1[i]) counts <- Read10X(data.dir = current_dir) scRNAlist[[i]] <- CreateSeuratObject(counts, project = sample_name1[i], min.cells = 3, min.features = 200) scRNAlist[[i]][["percent.mt"]] <- PercentageFeatureSet(scRNAlist[[i]], pattern = "^mt-") scRNAlist[[i]]$log10GenesPerUMI <- log10(scRNAlist[[i]]$nFeature_RNA)/log10(scRNAlist[[i]]$nCount_RNA) qc_before[[i]] <- scRNAlist[[i]] scRNAlist[[i]] <- basic_qc(scRNAlist[[i]], sample_name1[i]) qc_after[[i]] <- scRNAlist[[i]] scRNAlist[[i]] <- NormalizeData(scRNAlist[[i]]) scRNAlist[[i]] <- ScaleData(scRNAlist[[i]], verbose = FALSE) scRNAlist[[i]] <- FindVariableFeatures(scRNAlist[[i]], verbose = FALSE) scRNAlist[[i]] <- RunPCA(scRNAlist[[i]], npcs = 30, verbose = FALSE) scRNAlist[[i]] <- RunUMAP(scRNAlist[[i]], reduction = "pca", dims = 1:30) scRNAlist[[i]] <- FindNeighbors(scRNAlist[[i]], reduction = "pca", dims = 1:30) scRNAlist[[i]] <- FindClusters(scRNAlist[[i]], resolution = 0.8) print('layers after qc-----------------') print(names(scRNAlist[[i]][["RNA"]]@layers)) }

save(scRNAlist, file = './1_QC_scRNAlist') save(qc_before, file = './qc_before_scRNAlist') save(qc_after, file = './qc_after_scRNAlist')

load('1_QC_scRNAlist') load('qc_before_scRNAlist') load('qc_after_scRNAlist')

qc_before <- merge(qc_before[[1]], y = qc_before[-1]) qc_after <- merge(qc_after[[1]], y = qc_after[-1]) scRNA_qc <- merge(scRNAlist[[1]], y = scRNAlist[-1])

saveRDS(qc_before, "qc_before.rds") saveRDS(qc_after, "qc_after.rds") saveRDS(scRNA_qc, "scRNA_qc.rds")

qc_before <- readRDS("./qc_before.rds") qc_after <- readRDS("./qc_after.rds") scRNA_qc <- readRDS("./scRNA_qc.rds")

all_samples_qc_before <- unique(qc_before$orig.ident) if (!all(all_samples_qc_before %in% sample_name1)) { cat(red("Some samples in the qc_before data are not included in sample_name1\n")) }

qc_before$orig.ident <- factor(x = qc_before$orig.ident, levels = sample_name1) pdf("qc_before_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_before, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), group.by = "orig.ident", pt.size = 0, ncol = 3) dev.off()

pdf("qc_before_Genes_count_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_before, features = "nFeature_RNA", group.by = "orig.ident", pt.size = 0) + xlab("Sample") + ylab("Genes count") + ggtitle("") dev.off()

all_samples_qc_after <- unique(qc_after$orig.ident) if (!all(all_samples_qc_after %in% sample_name1)) { cat(red("Some samples in the qc_after data are not included in sample_name1\n")) }

qc_after$orig.ident <- factor(x = qc_after$orig.ident, levels = sample_name1) pdf("qc_after_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_after, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), group.by = "orig.ident", pt.size = 0, ncol = 3) dev.off()

pdf("qc_after_Genes_count_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_after, features = "nFeature_RNA", group.by = "orig.ident", pt.size = 0) + xlab("Sample") + ylab("Genes count") + ggtitle("") dev.off()

#############################################################################

二、doubletFinder----------------------------------------------------------------------------------------------------------------------------------------------

Doublet_dir <- file.path(base_dir, "2_Doublet") dir.create(Doublet_dir, showWarnings = FALSE, recursive = TRUE) setwd(paste(base_dir, "2_Doublet", sep = "/")) dir.create("doublet_counts", showWarnings = FALSE) Doublet_subset_before <- list() for(i in 1:length(sample_name1)){ cat("Processing subset", i, "of", sample_name1[i], "\n") print('layers before doubletFinder-----------------') print(names(scRNAlist[[i]][["RNA"]]@layers)) table(scRNAlist[[i]]$orig.ident) Doubletrate = ncol(scRNAlist[[i]])81e-6 sweep.res.list <- paramSweep(scRNAlist[[i]], PCs = 1:30, sct = FALSE) sweep.stats <- summarizeSweep(sweep.res.list, GT = FALSE) bcmvn <- find.pK(sweep.stats) mpK<-as.numeric(as.vector(bcmvn$pK[which.max(bcmvn$BCmetric)])) annotations <- scRNAlist[[i]]@meta.data$seurat_clusters homotypic.prop <- modelHomotypic(annotations)
nExp_poi <- round(Doubletratenrow(scRNAlist[[i]]@meta.data)) nExp_poi.adj <- round(nExp_poi(1-homotypic.prop)) scRNAlist[[i]] <- doubletFinder(scRNAlist[[i]], PCs = 1:30, pN = 0.25, pK = mpK, nExp = nExp_poi.adj, reuse.pANN = FALSE, sct = FALSE) scRNAlist[[i]]$doubFind_res = scRNAlist[[i]]@meta.data %>% dplyr::select(contains('DF.classifications')) scRNAlist[[i]]$doubFind_score = scRNAlist[[i]]@meta.data %>% dplyr::select(contains('pANN')) Doublet_subset_before[[i]] <- scRNAlist[[i]] doublet_count <- sum(Doublet_subset_before[[i]]$doubFind_res == "Doublet") singlet_count <- sum(Doublet_subset_before[[i]]$doubFind_res == "Singlet") result_df_doublet <- data.frame( Sample = sample_name1[i], Doublets = doublet_count, Singlets = singlet_count, Total = doublet_count + singlet_count ) current_sample <- sample_name1[i] write.csv(result_df_doublet, file = paste0("doublet_counts/", current_sample, "_doublet_counts.csv"), row.names = FALSE) cat("Processed sample", current_sample, "\n") scRNAlist[[i]] = subset(scRNAlist[[i]], doubFind_res == "Singlet") print('layers after doubletFinder-----------------') print(names(scRNAlist[[i]][["RNA"]]@layers)) print(table(scRNAlist[[i]]$orig.ident)) }

all_results <- do.call(rbind, lapply(list.files("doublet_counts", full.names = TRUE, pattern = "*.csv"), read.csv)) write.csv(all_results, file = "doublet_counts/all_samples_doublet_counts.csv", row.names = FALSE) cat("All results have been saved in the 'doublet_counts' directory.\n") save(scRNAlist, file = '2_Doublet_scRNAlist') save(Doublet_subset_before, file = 'Doublet_subset_before_scRNAlist')

load('2_Doublet_scRNAlist') load('Doublet_subset_before_scRNAlist')

Doublet_subset_before_merge <- merge(Doublet_subset_before[[1]], y = Doublet_subset_before[-1]) scRNA_Doublet <- merge(scRNAlist[[1]], y = scRNAlist[-1])

saveRDS(Doublet_subset_before_merge, "Doublet_subset_before_merge.rds") saveRDS(scRNA_Doublet, "scRNA_Doublet.rds")

Doublet_subset_before <- readRDS("Doublet_subset_before_merge.rds") scRNA_Doublet <- readRDS("scRNA_Doublet.rds")

sc_Merge <- scRNA_Doublet sc_Merge <- NormalizeData(sc_Merge) sc_Merge <- FindVariableFeatures(sc_Merge) sc_Merge <- ScaleData(sc_Merge) sc_Merge <- RunPCA(sc_Merge, npcs = 30, verbose = TRUE) sc_Merge <- RunUMAP(sc_Merge, dims = 1:30, reduction = "pca", reduction.name = "umap.unintegrated") print(names(sc_Merge[["RNA"]]@layers)) pdf('umap.unintegrated.pdf', width = 18, height = 12) umap_plot <- DimPlot(sc_Merge, reduction = "umap.unintegrated", group.by = c("orig.ident", 'seurat_clusters'), label = TRUE) print(umap_plot) dev.off()

sc_Merge_Integrated <- IntegrateLayers( object = sc_Merge, method = HarmonyIntegration, orig.reduction = "pca", new.reduction = "harmony", verbose = T ) print(names(sc_Merge_Integrated[["RNA"]]@layers)) sc_Merge_Integrated_joined <- JoinLayers(sc_Merge_Integrated) print(names(sc_Merge_Integrated_joined[["RNA"]]@layers)) sc_Merge_Integrated_joined <- FindNeighbors(sc_Merge_Integrated_joined, reduction = "harmony", dims = 1:30)

resolutions <- c(0.2, 0.5, 0.6, 0.8, 1.0, 1.2, 1.5, 1.8, 2) cell_type_markers <- c( 'CDKN2A', 'EPCAM', 'CD24', 'CDH1', # Epithelial cells 'PECAM1', 'CDH5', 'ENG', # Endothelial cells 'COL1A2', 'DCN', 'APOD', # Fibroblasts 'ACTA2', 'ACTG2', # Smooth muscle cells 'CD3E', 'CD3D', 'CD2', # Lymphocytes 'CD68', 'CD163', 'LYZ', # Macrophage 'CSF3R' # Neutrophils )

generate_plots <- function(seurat_obj, resolution) { seurat_obj <- FindClusters(seurat_obj, resolution = resolution, cluster.name = paste0("harmony_clusters_res", resolution)) seurat_obj <- RunUMAP(seurat_obj, reduction = "harmony", dims = 1:30, reduction.name = paste0("umap.harmony_res", resolution))

p_umap <- DimPlot( seurat_obj, reduction = paste0("umap.harmony_res", resolution), group.by = c("orig.ident", paste0("harmony_clusters_res", resolution)), combine = TRUE, label.size = 2, label = TRUE )

ggsave(paste0("harmony_integrated_res", resolution, ".pdf"), plot = p_umap, width = 18, height = 12)

dot_plot <- DotPlot(seurat_obj, features = cell_type_markers, group.by = paste0("harmony_clusters_res", resolution)) + coord_flip() + ggtitle(paste("Resolution:", resolution))

ggsave(paste0("DotPlot_harmony_res", resolution, ".pdf"), plot = dot_plot, width = 12, height = 10)

return(seurat_obj) }

for (res in resolutions) { cat("Processing resolution:", res, "\n") sc_Merge_Integrated_joined <- generate_plots(sc_Merge_Integrated_joined, res) }

saveRDS(sc_Merge_Integrated_joined, "sc_Merge_Integrated_joined_multi_res.rds")

sc_Merge_Integrated_joined <- FindClusters(sc_Merge_Integrated_joined, resolution = 0.8, cluster.name = "harmony_clusters")

sc_Merge_Integrated_joined <- RunUMAP(sc_Merge_Integrated_joined, reduction = "harmony", dims = 1:30, reduction.name = "umap.harmony")

p2_umap <- DimPlot(

sc_Merge_Integrated_joined,

reduction = "umap.harmony",

group.by = c("orig.ident", "harmony_clusters"),

combine = T, label.size = 2, label = TRUE

)

pdf('harmony.integrated.pdf', width = 18, height = 12)

print(p2_umap)

dev.off()

feats <- c("nFeature_RNA", "nCount_RNA") p1_1=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 2) + NoLegend() pdf('qc_nFeature_count_harmony.integrated.pdf', width = 18, height = 12) print(p1_1) dev.off() feats <- c("percent_mito", "percent_ribo", "percent_hb") p1_2=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 3) + NoLegend() pdf('qc_mt_harmony.integrated.pdf', width = 18, height = 12) print(p1_2) dev.off()

feats <- c("nFeature_RNA", "nCount_RNA") p1_3=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 2, group.by = "orig.ident") + NoLegend() pdf('qc_nFeature_count_harmony.integrated_sample.pdf', width = 18, height = 12) print(p1_3) dev.off() feats <- c("percent_mito", "percent_ribo", "percent_hb") p1_4=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 3, group.by = "orig.ident") + NoLegend() pdf('qc_mt_harmony.integrated_sample.pdf', width = 18, height = 12) print(p1_4) dev.off()

cat(paste0("\n", Sys.time(), " - Analysis completed\n")) sink()

asmlgkj commented 2 months ago

@samuel-marsh thanks a lot, you are really very kind

asmlgkj commented 2 months ago

I used the public data E-MTAB-11948, thanksa lot

i <- 1 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data" i <- 2 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data"

scRNA_qc <- merge(scRNAlist[[1]], y = scRNAlist[-1]) Warning: Some cell names are duplicated across objects provided. Renaming to enforce unique cell names. print(names(scRNA_qc@assays$RNA@layers)) [1] "counts.sample1" "counts.sample2" "data.sample1"
[4] "scale.data.sample1" "data.sample2" "scale.data.sample2" 4318ea8501461e7135a4409284f987ac

image

image

image

After scaledata again, the "scale.data.sample1", "scale.data.sample2" ,"scale.data" come out. this is a common question puzzling me, what steps can be done once by once, but what can not

asmlgkj commented 2 months ago

I used the public data E-MTAB-11948, thanksa lot

i <- 1 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data" i <- 2 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data"

scRNA_qc <- merge(scRNAlist[[1]], y = scRNAlist[-1]) Warning: Some cell names are duplicated across objects provided. Renaming to enforce unique cell names. print(names(scRNA_qc@assays$RNA@layers)) [1] "counts.sample1" "counts.sample2" "data.sample1"
[4] "scale.data.sample1" "data.sample2" "scale.data.sample2" 4318ea8501461e7135a4409284f987ac

image

image

image

After scaledata again, the "scale.data.sample1", "scale.data.sample2" ,"scale.data" come out. this is a common question puzzling me, what steps can be done once by once, but what can not

asmlgkj commented 2 months ago

One key question i"My thinking has been deeply guided by this sentence from the tutorial: 'Once integrative analysis is complete, you can rejoin the layers.' So I have two questions:

Is JoinLayers only usable after IntegrateLayers? Can JoinLayers be used repeatedly?

Could you please clarify these points?"s whether

samuel-marsh commented 2 months ago

So yes you can join layers right after integrate layers but can also do it later it just depends on whether what you are doing requires all of the data separate or not.

Once you run JoinLayers the layers are merged so running repeatedly after that will not do anything unless you have split the layers again.

So you got scale.data layers because you merged objects. I would suggest joining layers to condense the data and then can split again to start your analysis.

Best, Sam

asmlgkj commented 2 months ago

Thanks a lot. I am still confused. I do qc, Doublet seperately, then merge and normalize, scaledata, integrate , joinlayers. and get the result I showed. is there anything wrong, when you said ' you got scale.data layers because you merged objects.' do you mean merge here is not needed?

I am confused about which step must be analyzed in separately for each sample, and which should be merge or merged and joined, then to analyze furtherly, Can you provide me with further guidance

the seurat tutorial mentioned joinlayers should be done before differential gene analysis , and I not found other more information

asmlgkj commented 2 months ago

Thanks a lot in https://satijalab.org/seurat/articles/seurat5_integration, it split seurat, but does not use merge, so in my analysis code, do I need merge or not, question like this also make me puzzled

samuel-marsh commented 1 month ago

So you can still use merge at first but then you should run joinlayers before starting analysis. This will remove the multiple scale.data layers