Closed asmlgkj closed 1 month ago
Hi @asmlgkj,
Can you post the full code that you are running with your samples?
The data should look like it does in the example data. The reason is because of what ScaleData
is doing and why you run it (creating scaled expression across all cells). It would not be valid to combine scaled data from multiple samples into a new matrix just by combining them because they need to be scaled together. I'm not sure how you ended up with multiple layers like that as following the vignette code I cannot replicate what you are seeing with your own data.
Best, Sam
rm(list = ls()) Sys.setenv(R_MAX_NUM_DLLS=999) gc() base_dir <- '/home/huyu' log_file <- "log.txt" sink(log_file, append = TRUE, split = TRUE)
cat(paste0("\n", Sys.time(), " - Starting analysis\n")) suppressPackageStartupMessages({ library(Seurat) library(tidyselect) library(tidyverse) library(dplyr) library(ggpubr) library(DoubletFinder) library(crayon) })
base_dir <- '/home/huyu' dir1_1 <- base_dir qc_dir <- file.path(base_dir, "1_QC") dir.create(qc_dir, showWarnings = FALSE, recursive = TRUE) setwd(qc_dir)
sample_name1 <- c("sample1", "sample2", "sample3", "sample4", "sample6")
scRNAlist <- list() qc_before <- list() qc_after <- list()
basic_qc <- function(input_sce, sampleid) {
mito_genes <- rownames(input_sce)[grep("^MT-", rownames(input_sce), ignore.case = TRUE)]
input_sce <- PercentageFeatureSet(input_sce, features = mito_genes, col.name = "percent_mito") fivenum(input_sce@meta.data$percent_mito)
ribo_genes <- rownames(input_sce)[grep("^Rp[sl]", rownames(input_sce), ignore.case = TRUE)]
input_sce <- PercentageFeatureSet(input_sce, features = ribo_genes, col.name = "percent_ribo") fivenum(input_sce@meta.data$percent_ribo)
Hb_genes <- rownames(input_sce)[grep("^Hb[^(p)]", rownames(input_sce), ignore.case = TRUE)]
input_sce <- PercentageFeatureSet(input_sce, features = Hb_genes, col.name = "percent_hb") fivenum(input_sce@meta.data$percent_hb)
feats <- c("nFeature_RNA", "nCount_RNA") p1 <- VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + NoLegend() p1 w <- length(unique(input_sce$orig.ident)) / 3 + 5 ggsave(filename = paste0("Vlnplot_pre_qc_nFeaturenCount", sampleid, ".pdf"), plot = p1, width = w, height = 5)
feats <- c("percent_mito", "percent_ribo", "percent_hb")
p2 <- VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3, same.y.lims = TRUE) +
scale_y_continuous(breaks = seq(0, 100, 5)) +
NoLegend()
p2
w <- length(unique(input_sce$orig.ident)) / 2 + 5
ggsave(filename = paste0("Vlnplot_pre_qc_mito_ribohb", sampleid, ".pdf"), plot = p2, width = w, height = 5)
p3 <- FeatureScatter(input_sce, "nCount_RNA", "nFeatureRNA", group.by = "orig.ident", pt.size = 0.5) ggsave(filename = paste0("Scatterplot", sampleid, ".pdf"), plot = p3)
input_sce.filt <- input_sce
C <- subset(input_sce.filt, downsample = 100)@assays$RNA$counts C <- Matrix::t(Matrix::t(C) / Matrix::colSums(C)) * 100 most_expressed <- order(apply(C, 1, median), decreasing = TRUE)[50:1] pdf(paste0("TOP50_most_expressedgene", sampleid, ".pdf"), width = 14) boxplot(as.matrix(Matrix::t(C[most_expressed, ])), cex = 0.1, las = 1, xlab = "% total count per cell", col = (scales::hue_pal())(50)[50:1], horizontal = TRUE) dev.off() rm(C)
selected_mito <- WhichCells(input_sce.filt, expression = percent_mito < 25) selected_ribo <- WhichCells(input_sce.filt, expression = percent_ribo > 3) selected_hb <- WhichCells(input_sce.filt, expression = percent_hb < 1) selected_exp <- WhichCells(input_sce.filt, expression = nFeature_RNA > 200 & nCount_RNA > 500)
input_sce.filt <- subset(input_sce.filt, cells = selected_mito) input_sce.filt <- subset(input_sce.filt, cells = selected_ribo) input_sce.filt <- subset(input_sce.filt, cells = selected_hb) input_sce.filt <- subset(input_sce.filt, cells = selected_exp)
if (FALSE) { selected_f <- rownames(input_sce)[Matrix::rowSums(input_sce@assays$RNA$counts > 0) > 3] input_sce.filt <- subset(input_sce, cells = selected_c) dim(input_sce) dim(input_sce.filt) }
feats <- c("nFeature_RNA", "nCount_RNA") p1_filtered <- VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + NoLegend() w <- length(unique(input_sce.filt$orig.ident)) / 3 + 5 ggsave(filename = paste0("Vlnplot_after_qc_nFeaturenCount", sampleid, ".pdf"), plot = p1_filtered, width = w, height = 5)
feats <- c("percent_mito", "percent_ribo", "percent_hb") p2_filtered <- VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3) + NoLegend() w <- length(unique(input_sce.filt$orig.ident)) / 2 + 5 ggsave(filename = paste0("Vlnplot_after_qc_mito_ribohb", sampleid, ".pdf"), plot = p2_filtered, width = w, height = 5)
return(input_sce.filt) }
for (i in 1:length(sample_name1)) { cat("Processing subset", i, "of", sample_name1[i], "\n") current_dir <- file.path(dir1_1, sample_name1[i]) counts <- Read10X(data.dir = current_dir) scRNAlist[[i]] <- CreateSeuratObject(counts, project = sample_name1[i], min.cells = 3, min.features = 200) scRNAlist[[i]][["percent.mt"]] <- PercentageFeatureSet(scRNAlist[[i]], pattern = "^mt-") scRNAlist[[i]]$log10GenesPerUMI <- log10(scRNAlist[[i]]$nFeature_RNA)/log10(scRNAlist[[i]]$nCount_RNA) qc_before[[i]] <- scRNAlist[[i]] scRNAlist[[i]] <- basic_qc(scRNAlist[[i]], sample_name1[i]) qc_after[[i]] <- scRNAlist[[i]] scRNAlist[[i]] <- NormalizeData(scRNAlist[[i]]) scRNAlist[[i]] <- ScaleData(scRNAlist[[i]], verbose = FALSE) scRNAlist[[i]] <- FindVariableFeatures(scRNAlist[[i]], verbose = FALSE) scRNAlist[[i]] <- RunPCA(scRNAlist[[i]], npcs = 30, verbose = FALSE) scRNAlist[[i]] <- RunUMAP(scRNAlist[[i]], reduction = "pca", dims = 1:30) scRNAlist[[i]] <- FindNeighbors(scRNAlist[[i]], reduction = "pca", dims = 1:30) scRNAlist[[i]] <- FindClusters(scRNAlist[[i]], resolution = 0.8) print('layers after qc-----------------') print(names(scRNAlist[[i]][["RNA"]]@layers)) }
save(scRNAlist, file = './1_QC_scRNAlist') save(qc_before, file = './qc_before_scRNAlist') save(qc_after, file = './qc_after_scRNAlist')
load('1_QC_scRNAlist') load('qc_before_scRNAlist') load('qc_after_scRNAlist')
qc_before <- merge(qc_before[[1]], y = qc_before[-1]) qc_after <- merge(qc_after[[1]], y = qc_after[-1]) scRNA_qc <- merge(scRNAlist[[1]], y = scRNAlist[-1])
saveRDS(qc_before, "qc_before.rds") saveRDS(qc_after, "qc_after.rds") saveRDS(scRNA_qc, "scRNA_qc.rds")
qc_before <- readRDS("./qc_before.rds") qc_after <- readRDS("./qc_after.rds") scRNA_qc <- readRDS("./scRNA_qc.rds")
all_samples_qc_before <- unique(qc_before$orig.ident) if (!all(all_samples_qc_before %in% sample_name1)) { cat(red("Some samples in the qc_before data are not included in sample_name1\n")) }
qc_before$orig.ident <- factor(x = qc_before$orig.ident, levels = sample_name1) pdf("qc_before_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_before, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), group.by = "orig.ident", pt.size = 0, ncol = 3) dev.off()
pdf("qc_before_Genes_count_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_before, features = "nFeature_RNA", group.by = "orig.ident", pt.size = 0) + xlab("Sample") + ylab("Genes count") + ggtitle("") dev.off()
all_samples_qc_after <- unique(qc_after$orig.ident) if (!all(all_samples_qc_after %in% sample_name1)) { cat(red("Some samples in the qc_after data are not included in sample_name1\n")) }
qc_after$orig.ident <- factor(x = qc_after$orig.ident, levels = sample_name1) pdf("qc_after_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_after, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), group.by = "orig.ident", pt.size = 0, ncol = 3) dev.off()
pdf("qc_after_Genes_count_VlnPlot.pdf", width = 12, height = 4) VlnPlot(qc_after, features = "nFeature_RNA", group.by = "orig.ident", pt.size = 0) + xlab("Sample") + ylab("Genes count") + ggtitle("") dev.off()
#############################################################################
Doublet_dir <- file.path(base_dir, "2_Doublet")
dir.create(Doublet_dir, showWarnings = FALSE, recursive = TRUE)
setwd(paste(base_dir, "2_Doublet", sep = "/"))
dir.create("doublet_counts", showWarnings = FALSE)
Doublet_subset_before <- list()
for(i in 1:length(sample_name1)){
cat("Processing subset", i, "of", sample_name1[i], "\n")
print('layers before doubletFinder-----------------')
print(names(scRNAlist[[i]][["RNA"]]@layers))
table(scRNAlist[[i]]$orig.ident)
Doubletrate = ncol(scRNAlist[[i]])81e-6
sweep.res.list <- paramSweep(scRNAlist[[i]], PCs = 1:30, sct = FALSE)
sweep.stats <- summarizeSweep(sweep.res.list, GT = FALSE)
bcmvn <- find.pK(sweep.stats)
mpK<-as.numeric(as.vector(bcmvn$pK[which.max(bcmvn$BCmetric)]))
annotations <- scRNAlist[[i]]@meta.data$seurat_clusters
homotypic.prop <- modelHomotypic(annotations)
nExp_poi <- round(Doubletratenrow(scRNAlist[[i]]@meta.data))
nExp_poi.adj <- round(nExp_poi(1-homotypic.prop))
scRNAlist[[i]] <- doubletFinder(scRNAlist[[i]], PCs = 1:30, pN = 0.25, pK = mpK, nExp = nExp_poi.adj, reuse.pANN = FALSE, sct = FALSE)
scRNAlist[[i]]$doubFind_res = scRNAlist[[i]]@meta.data %>% dplyr::select(contains('DF.classifications'))
scRNAlist[[i]]$doubFind_score = scRNAlist[[i]]@meta.data %>% dplyr::select(contains('pANN'))
Doublet_subset_before[[i]] <- scRNAlist[[i]]
doublet_count <- sum(Doublet_subset_before[[i]]$doubFind_res == "Doublet")
singlet_count <- sum(Doublet_subset_before[[i]]$doubFind_res == "Singlet")
result_df_doublet <- data.frame(
Sample = sample_name1[i],
Doublets = doublet_count,
Singlets = singlet_count,
Total = doublet_count + singlet_count
)
current_sample <- sample_name1[i]
write.csv(result_df_doublet, file = paste0("doublet_counts/", current_sample, "_doublet_counts.csv"), row.names = FALSE)
cat("Processed sample", current_sample, "\n")
scRNAlist[[i]] = subset(scRNAlist[[i]], doubFind_res == "Singlet")
print('layers after doubletFinder-----------------')
print(names(scRNAlist[[i]][["RNA"]]@layers))
print(table(scRNAlist[[i]]$orig.ident))
}
all_results <- do.call(rbind, lapply(list.files("doublet_counts", full.names = TRUE, pattern = "*.csv"), read.csv)) write.csv(all_results, file = "doublet_counts/all_samples_doublet_counts.csv", row.names = FALSE) cat("All results have been saved in the 'doublet_counts' directory.\n") save(scRNAlist, file = '2_Doublet_scRNAlist') save(Doublet_subset_before, file = 'Doublet_subset_before_scRNAlist')
load('2_Doublet_scRNAlist') load('Doublet_subset_before_scRNAlist')
Doublet_subset_before_merge <- merge(Doublet_subset_before[[1]], y = Doublet_subset_before[-1]) scRNA_Doublet <- merge(scRNAlist[[1]], y = scRNAlist[-1])
saveRDS(Doublet_subset_before_merge, "Doublet_subset_before_merge.rds") saveRDS(scRNA_Doublet, "scRNA_Doublet.rds")
Doublet_subset_before <- readRDS("Doublet_subset_before_merge.rds") scRNA_Doublet <- readRDS("scRNA_Doublet.rds")
sc_Merge <- scRNA_Doublet sc_Merge <- NormalizeData(sc_Merge) sc_Merge <- FindVariableFeatures(sc_Merge) sc_Merge <- ScaleData(sc_Merge) sc_Merge <- RunPCA(sc_Merge, npcs = 30, verbose = TRUE) sc_Merge <- RunUMAP(sc_Merge, dims = 1:30, reduction = "pca", reduction.name = "umap.unintegrated") print(names(sc_Merge[["RNA"]]@layers)) pdf('umap.unintegrated.pdf', width = 18, height = 12) umap_plot <- DimPlot(sc_Merge, reduction = "umap.unintegrated", group.by = c("orig.ident", 'seurat_clusters'), label = TRUE) print(umap_plot) dev.off()
sc_Merge_Integrated <- IntegrateLayers( object = sc_Merge, method = HarmonyIntegration, orig.reduction = "pca", new.reduction = "harmony", verbose = T ) print(names(sc_Merge_Integrated[["RNA"]]@layers)) sc_Merge_Integrated_joined <- JoinLayers(sc_Merge_Integrated) print(names(sc_Merge_Integrated_joined[["RNA"]]@layers)) sc_Merge_Integrated_joined <- FindNeighbors(sc_Merge_Integrated_joined, reduction = "harmony", dims = 1:30)
resolutions <- c(0.2, 0.5, 0.6, 0.8, 1.0, 1.2, 1.5, 1.8, 2) cell_type_markers <- c( 'CDKN2A', 'EPCAM', 'CD24', 'CDH1', # Epithelial cells 'PECAM1', 'CDH5', 'ENG', # Endothelial cells 'COL1A2', 'DCN', 'APOD', # Fibroblasts 'ACTA2', 'ACTG2', # Smooth muscle cells 'CD3E', 'CD3D', 'CD2', # Lymphocytes 'CD68', 'CD163', 'LYZ', # Macrophage 'CSF3R' # Neutrophils )
generate_plots <- function(seurat_obj, resolution) { seurat_obj <- FindClusters(seurat_obj, resolution = resolution, cluster.name = paste0("harmony_clusters_res", resolution)) seurat_obj <- RunUMAP(seurat_obj, reduction = "harmony", dims = 1:30, reduction.name = paste0("umap.harmony_res", resolution))
p_umap <- DimPlot( seurat_obj, reduction = paste0("umap.harmony_res", resolution), group.by = c("orig.ident", paste0("harmony_clusters_res", resolution)), combine = TRUE, label.size = 2, label = TRUE )
ggsave(paste0("harmony_integrated_res", resolution, ".pdf"), plot = p_umap, width = 18, height = 12)
dot_plot <- DotPlot(seurat_obj, features = cell_type_markers, group.by = paste0("harmony_clusters_res", resolution)) + coord_flip() + ggtitle(paste("Resolution:", resolution))
ggsave(paste0("DotPlot_harmony_res", resolution, ".pdf"), plot = dot_plot, width = 12, height = 10)
return(seurat_obj) }
for (res in resolutions) { cat("Processing resolution:", res, "\n") sc_Merge_Integrated_joined <- generate_plots(sc_Merge_Integrated_joined, res) }
feats <- c("nFeature_RNA", "nCount_RNA") p1_1=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 2) + NoLegend() pdf('qc_nFeature_count_harmony.integrated.pdf', width = 18, height = 12) print(p1_1) dev.off() feats <- c("percent_mito", "percent_ribo", "percent_hb") p1_2=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 3) + NoLegend() pdf('qc_mt_harmony.integrated.pdf', width = 18, height = 12) print(p1_2) dev.off()
feats <- c("nFeature_RNA", "nCount_RNA") p1_3=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 2, group.by = "orig.ident") + NoLegend() pdf('qc_nFeature_count_harmony.integrated_sample.pdf', width = 18, height = 12) print(p1_3) dev.off() feats <- c("percent_mito", "percent_ribo", "percent_hb") p1_4=VlnPlot(sc_Merge_Integrated_joined , features = feats, pt.size = 0, ncol = 3, group.by = "orig.ident") + NoLegend() pdf('qc_mt_harmony.integrated_sample.pdf', width = 18, height = 12) print(p1_4) dev.off()
cat(paste0("\n", Sys.time(), " - Analysis completed\n")) sink()
@samuel-marsh thanks a lot, you are really very kind
I used the public data E-MTAB-11948, thanksa lot
i <- 1 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data" i <- 2 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data"
scRNA_qc <- merge(scRNAlist[[1]], y = scRNAlist[-1]) Warning: Some cell names are duplicated across objects provided. Renaming to enforce unique cell names. print(names(scRNA_qc@assays$RNA@layers)) [1] "counts.sample1" "counts.sample2" "data.sample1"
[4] "scale.data.sample1" "data.sample2" "scale.data.sample2"
After scaledata again, the "scale.data.sample1", "scale.data.sample2" ,"scale.data" come out. this is a common question puzzling me, what steps can be done once by once, but what can not
I used the public data E-MTAB-11948, thanksa lot
i <- 1 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data" i <- 2 print(names(scRNAlist[[i]]@assays$RNA@layers)) [1] "counts" "data" "scale.data"
scRNA_qc <- merge(scRNAlist[[1]], y = scRNAlist[-1]) Warning: Some cell names are duplicated across objects provided. Renaming to enforce unique cell names. print(names(scRNA_qc@assays$RNA@layers)) [1] "counts.sample1" "counts.sample2" "data.sample1"
[4] "scale.data.sample1" "data.sample2" "scale.data.sample2"
After scaledata again, the "scale.data.sample1", "scale.data.sample2" ,"scale.data" come out. this is a common question puzzling me, what steps can be done once by once, but what can not
One key question i"My thinking has been deeply guided by this sentence from the tutorial: 'Once integrative analysis is complete, you can rejoin the layers.' So I have two questions:
Is JoinLayers only usable after IntegrateLayers? Can JoinLayers be used repeatedly?
Could you please clarify these points?"s whether
So yes you can join layers right after integrate layers but can also do it later it just depends on whether what you are doing requires all of the data separate or not.
Once you run JoinLayers the layers are merged so running repeatedly after that will not do anything unless you have split the layers again.
So you got scale.data layers because you merged objects. I would suggest joining layers to condense the data and then can split again to start your analysis.
Best, Sam
Thanks a lot. I am still confused. I do qc, Doublet seperately, then merge and normalize, scaledata, integrate , joinlayers. and get the result I showed. is there anything wrong, when you said ' you got scale.data layers because you merged objects.' do you mean merge here is not needed?
I am confused about which step must be analyzed in separately for each sample, and which should be merge or merged and joined, then to analyze furtherly, Can you provide me with further guidance
the seurat tutorial mentioned joinlayers should be done before differential gene analysis , and I not found other more information
Thanks a lot in https://satijalab.org/seurat/articles/seurat5_integration, it split seurat, but does not use merge, so in my analysis code, do I need merge or not, question like this also make me puzzled
So you can still use merge at first but then you should run joinlayers before starting analysis. This will remove the multiple scale.data layers
Thanks a lot. I am learning seurat v5.
The above code show layers as follwong. I am wondering why scale.data was not for each sample
when I used the similar code to analyze my onwn data, I found as follwing. It has many scale.data, including for each sample, where I am wrong