超越代码4：数据整合与降维聚类

超越代码4：数据整合与降维聚类 by 生信会客厅

这次分享的教程源自我多年优化的实战代码，其中有很多提高效率和优化结果的自定义函数封装于scPRIT包。由于scPRIT包依赖特定的R和Python环境，我特意将它集成在docker镜像kinesin/rstudio:2.0.0中。为了确保教程代码的顺畅运行，推荐使用该docker镜像提供的分析环境。

恢复分析环境容器

Docker容器第一次启动时需要配置一些参数，但是停止运行后重新恢复则非常简单。下面演示恢复上一节停止的rstudio容器：

Windows系统

找到昨天的容器（我把它命名为rstudio），按下启动按钮就可以了

Linux系统

# 查看当前所有容器(可以不运行)docker ps -a# 启动rstudio的容器docker start rstudio

数据预处理

初始化R环境

library(Seurat)library(tidyverse)library(patchwork)library(parallel)library(future)library(scPRIT)
setwd("~/project/BeyondCode/CellType")options(scprithost = "xxx.xxx.xxx.xxx", authcode = "xxxxxxxxxxxx")options(future.globals.maxSize = 30 * 1024^3)

Seurat4对象中任何表达矩阵都是所有样本合并的，Seurat5在很多情况下保持样本表达矩阵是独立性。Seurat5对象的表达矩阵保存在layers中，最开始每个样本一个counts矩阵，经过NormalizeData会为每个样本生成data矩阵，运行ScaleData之后会添加一个合并所有样本的scale.data矩阵。

数据预处理

scRNA <- readRDS("~/project/BeyondCode/Data/sco.sub.qc.rds")scRNA <- NormalizeData(scRNA)scRNA <- FindVariableFeatures(scRNA)scRNA <- ScaleData(scRNA)scRNA <- RunPCA(scRNA, verbose = F)ElbowPlot(scRNA, ndims = 50)saveRDS(scRNA, file = "sco.log_pp.rds")

常规降维聚类

数据整合工具虽然可以在一定程度上校正批次效应，但是也会引入新的误差。因此我们拿到数据后不要盲目地运行数据整合流程，最好先看看数据本身的情况再做决定。

pcs <- 1:20scRNA <- FindNeighbors(scRNA, reduction = "pca", dims = pcs)scRNA <- FindClusters(scRNA, resolution = 1, cluster.name = "raw_cluster")scRNA <- RunUMAP(scRNA, reduction = "pca", dims = pcs)scRNA <- RunTSNE(scRNA, reduction = "pca", dims = pcs)save(scRNA, file = "sco.log_dr.rda")

查看聚类结果与SingleR预测的细胞类型

p1 <- DimPlot(scRNA, reduction = "umap", group.by = "seurat_clusters", label = T) + NoLegend()p2 <- DimPlot(scRNA, reduction = "umap", group.by = "SingleR", label = T) + NoLegend()p <- p1 + p2ggsave("unInt_SingleR_Cluster.pdf", p, width = 15, height = 7)

查看doublets预测结果

p1 <- FeaturePlot(scRNA, reduction = "umap", features = "DF.scores")p2 <- DimPlot(scRNA, reduction = "umap", group.by = "DF.doublets") +   scale_color_manual(values = c("red","gray"))p3 <- FeaturePlot(scRNA, reduction = "umap", features = "scr_score")p4 <- DimPlot(scRNA, reduction = "umap", group.by = "scr_pred") +   scale_color_manual(values = c("gray","red"))p <- (p1|p2)/(p3|p4)ggsave("unInt_Doublets.pdf", p, width = 12, height = 9)

查看样本分组

 p <- DimPlot(scRNA, reduction = "umap", group.by = c("SampleID","Group"), ncol = 2)ggsave("unInt_Groups.pdf", p, width = 15, height = 7)

数据整合后降维聚类

下面介绍的三种整合方法CCA、RPCA和Harmony，基本可以解决绝大部分的分析场景。CCA一般用于跨物种、平台、组织的数据整合，一般的实验批次效应校正用RPCA和Harmony就可以了。CCA和RPCA使用k.anchor参数调整整合力度，Harmony使用lambda参数调整整合力度。

# CCA整合plan(multisession, workers = 16) scRNA <- IntegrateLayers(  scRNA, method = CCAIntegration, k.anchor = 5,  orig.reduction = "pca", new.reduction = "cca",  verbose = FALSE)plan(sequential)
# RPCA整合plan(multisession, workers = 16) scRNA <- IntegrateLayers(  scRNA, method = RPCAIntegration, k.anchor = 5,  orig.reduction = "pca", new.reduction = "rpca",  verbose = FALSE)plan(sequential)
# Harmony整合 scRNA <- IntegrateLayers(  scRNA, method = HarmonyIntegration, lambda = 1,  orig.reduction = "pca", new.reduction = "harmony",  verbose = FALSE)  # CCA整合结果降维聚类scRNA <- FindNeighbors(scRNA, reduction = "cca", dims = 1:30)scRNA <- FindClusters(scRNA, resolution = 0.5, cluster.name = "cca_cluster")scRNA <- RunUMAP(scRNA, reduction = "cca", dims = 1:30, reduction.name = "cca.umap")
# RPCA整合结果降维聚类scRNA <- FindNeighbors(scRNA, reduction = "rpca", dims = 1:30)scRNA <- FindClusters(scRNA, resolution = 0.5, cluster.name = "rpca_cluster")scRNA <- RunUMAP(scRNA, reduction = "rpca", dims = 1:30, reduction.name = "rpca.umap")
# Harmony整合结果降维聚类scRNA <- FindNeighbors(scRNA, reduction = "harmony", dims = 1:20)scRNA <- FindClusters(scRNA, resolution = 0.5, cluster.name = "hny_cluster")scRNA <- RunUMAP(scRNA, reduction = "harmony", dims = 1:20, reduction.name = "hny.umap")
# 保存结果saveRDS(scRNA, file = "~/project/BeyondCode/Data/sco.int_dr.rds")

对比整合结果

# 画图展示p1 <- DimPlot(scRNA, reduction = "umap", group.by = "raw_cluster", label = T) + NoLegend()p2 <- DimPlot(scRNA, reduction = "cca.umap", group.by = "cca_cluster", label = T) + NoLegend()p3 <- DimPlot(scRNA, reduction = "rpca.umap", group.by = "rpca_cluster", label = T) + NoLegend()p4 <- DimPlot(scRNA, reduction = "hny.umap", group.by = "hny_cluster", label = T) + NoLegend()p <- (p1+p2)/(p3+p4)ggsave("Cluster_comparision.pdf", p, width = 12, height = 10)
p1 <- DimPlot(scRNA, reduction = "umap", group.by = "SingleR", label = T) + NoLegend()p2 <- DimPlot(scRNA, reduction = "cca.umap", group.by = "SingleR", label = T) + NoLegend()p3 <- DimPlot(scRNA, reduction = "rpca.umap", group.by = "SingleR", label = T) + NoLegend()p4 <- DimPlot(scRNA, reduction = "hny.umap", group.by = "SingleR", label = T) + NoLegend()p <- (p1+p2)/(p3+p4)ggsave("SingleR_comparision.pdf", p, width = 12, height = 10)

聚类结果对比

细胞分布对比

SCTransform分析流程

传统的CPM方法校正测序深度对基因表达值的影响时，在某些情况下可能无法达到最佳效果。Jan Lause等研究人员通过采用皮尔逊残差分析来校正测序深度，展现出了更优的性能。这一方法可以通过Seurat的SCTransform函数实现。接下来将提供基于此方法的一个分析流程，供对此感兴趣的朋友参考。

scRNA <- readRDS("~/project/BeyondCode/Data/sco.sub.qc.rds")scRNA <- SCTransform(scRNA)scRNA <- RunPCA(scRNA, npcs = 30, verbose = F)scRNA <- IntegrateLayers(  object = scRNA, k.anchor = 5,  method = RPCAIntegration,  normalization.method = "SCT",  verbose = F)scRNA <- FindNeighbors(scRNA, dims = 1:30, reduction = "integrated.dr")scRNA <- FindClusters(scRNA, resolution = 0.5)scRNA <- RunUMAP(scRNA, dims = 1:30, reduction = "integrated.dr")save(scRNA, file = "sco.sct_rpca.rda")

分析完成后记得停止容器的运行！

数据与资源

本节分析数据与结果百度云下载链接https://pan.baidu.com/s/1xHQQ6ygaqTT5MjKRZTh6TA?pwd=eqjw

提取码：eqjw

kinesin/rstudio:2.0.0镜像和scPRIT包的开发、测试非常耗费时间和精力，后续我希望可以持续提供维护和更新服务，因此计划采用收费授权的模式，具体标准如下：

200元，授权使用90天
600元，授权使用一年

授权期内如有更新免费升级，有意者添加Kinesin微信购买。

超越代码实战训练班

很多第一次分析单细胞数据的朋友，会把70%的时间耗费在代码和生信基础问题上，真正用来思考生物学问题的时间非常有限，因此产出高水平的文章相对比较困难。近几年，我深度参与了几十个单细胞、空转项目的分析，目前已有十几篇合作的署名/致谢文章发表。如果有我这么一个过来人提供分析环境和实战流程，并参与您的项目讨论，或许可以起到事半功倍的效果。如果您对此有兴趣，请阅读《超越代码》和《单细胞分析实战训练班的课程规划》了解详情！

ixxmu / mp_duty

超越代码4：数据整合与降维聚类 #5713

超越代码4：数据整合与降维聚类 by 生信会客厅