Closed ixxmu closed 1 year ago
春风得意马蹄疾,一日看尽长安花
作为快速开始,本章将简单展示SCP三个模块(前处理、下游分析、可视化)的大致功能,各模块各函数的使用细节将在后续教程中详细说明。
目录: | |
---|---|
1. 数据探索 | 9. RNA速率分析 |
2. 细胞质控 | 10. 差异表达分析 |
3. 标准的单细胞处理流程 | 11. 富集分析(ORA) |
4. 批次矫正的单细胞整合处理流程 | 12. 富集分析(GSEA) |
5. 单细胞数据集间的映射 | 13. 轨迹推断 |
6. 使用Bulk RNA-seq数据进行细胞注释 | 14. 动态特征鉴定 |
7. 使用scRNA-seq数据进行细胞注释 | 15. 交互式的单细胞数据查询网页(SCExplorer) |
8. PAGA分析 | 16. 一些函数的可视化示例 |
本章主要使用下采样后的小鼠胚胎E15.5天的胰腺上皮单细胞数据进行分析,可以通过在R中运行?pancreas_sub
来查看该示例数据相关信息。
该数据中的细胞类型已经被注释,其中pancreas_sub[["CellType"]]
为粗分细胞类型,分为胰腺导管细胞(Ductal),内分泌前体细胞(Ngn3 low EP,Ngn3 high EP),前内分泌细胞(Pre-endocrine)和内分泌细胞(Endocrine)五种细胞类型。pancreas_sub[["SubCellType"]]
则进一步细分了终末分化的内分泌细胞类型,包括Beta,Alpha,Delta,Epsilon。
另外该数据已含有PCA
和UMAP
坐标信息,以及spliced
和unspliced
矩阵(可用于RNA速率分析)。
library(SCP)
library(BiocParallel)
register(MulticoreParam(workers = 8, progressbar = TRUE))
data("pancreas_sub")
print(pancreas_sub)
#> An object of class Seurat
#> 47874 features across 1000 samples within 3 assays
#> Active assay: RNA (15958 features, 3467 variable features)
#> 2 other assays present: spliced, unspliced
#> 2 dimensional reductions calculated: PCA, UMAP
注意,SCP的多线程策略与Seurat不同,使用BiocParallel[1]而不是future[2]。建议使用SCP时不要启用future,否则在一些函数运行上会变慢。
这里使用BiocParallel设置了线程数为8,也可以在运行相关函数时手动设置,SCP支持多线程的函数有:
然后可以使用一些函数对数据进行探索性的可视化:
CellDimPlot(
srt = pancreas_sub, group.by = c("CellType", "SubCellType"),
reduction = "UMAP", theme_use = "theme_blank"
)
CellDimPlot(
srt = pancreas_sub, group.by = "SubCellType", stat.by = "Phase",
reduction = "UMAP", theme_use = "theme_blank"
)
FeatureDimPlot(
srt = pancreas_sub, features = c("Sox9", "Neurog3", "Fev", "Rbp4"),
reduction = "UMAP", theme_use = "theme_blank"
)
FeatureDimPlot(
srt = pancreas_sub, features = c("Ins1", "Gcg", "Sst", "Ghrl"),
compare_features = TRUE, label = TRUE, label_insitu = TRUE,
reduction = "UMAP", theme_use = "theme_blank"
)
ht <- GroupHeatmap(
srt = pancreas_sub,
features = c(
"Sox9", "Anxa2", # Ductal
"Neurog3", "Hes6", # EPs
"Fev", "Neurod1", # Pre-endocrine
"Rbp4", "Pyy", # Endocrine
"Ins1", "Gcg", "Sst", "Ghrl" # Beta, Alpha, Delta, Epsilon
),
group.by = c("CellType", "SubCellType"),
heatmap_palette = "YlOrRd",
cell_annotation = c("Phase", "G2M_score", "Cdh2"),
cell_annotation_palette = c("Dark2", "Paired", "Paired"),
show_row_names = TRUE, row_names_side = "left",
add_dot = TRUE, add_reticle = TRUE
)
print(ht$plot)
RunCellQC
将会采用默认质控策略对各细胞进行质控指标的计算和分类。
因为示例数据已无需质控,这里仅作演示:
pancreas_sub <- RunCellQC(srt = pancreas_sub)
CellDimPlot(srt = pancreas_sub, group.by = "CellQC", reduction = "UMAP")
CellStatPlot(srt = pancreas_sub, stat.by = "CellQC", group.by = "CellType", label = TRUE)
CellStatPlot(
srt = pancreas_sub,
stat.by = c(
"db_qc", "outlier_qc", "umi_qc", "gene_qc",
"mito_qc", "ribo_qc", "ribo_mito_ratio_qc", "species_qc"
),
plot_type = "upset", stat_level = "Fail"
)
对于单个单细胞样本,可以直接运行Standard_SCP
来进行normalization,feature selection, dimension reduction,clustering等标准的单细胞数据处理流程,最终获得细胞分群,降维后坐标等:
pancreas_sub <- Standard_SCP(srt = pancreas_sub)
CellDimPlot(
srt = pancreas_sub, group.by = c("CellType", "SubCellType"),
reduction = "StandardUMAP2D", theme_use = "theme_blank"
)
Standard_SCP
默认还会计算UMAP的三维空间嵌入坐标,可进行3D可视化:
CellDimPlot3D(srt = pancreas_sub, group.by = "SubCellType")
对于具有批次效应的多批次数据,可以使用Integration_SCP
进行批次矫正。
Integration_SCP
需要两个设置额外参数:
batch
Seurat对象中用于区分批次信息的meta.data属性integration_method
用于矫正的算法,支持Seurat, scVI, MNN, fastMNN,Harmony, Scanorama, BBKNN, CSS, LIGER, Conos,Combat,不作矫正的话可以设置为Uncorrected接下以一个下采样后的8个成人胰腺单细胞数据集为例,进行不同算法的批次效应矫正,可以通过在R中运行?panc8_sub
来查看该示例数据相关信息:
data("panc8_sub")
panc8_sub <- Integration_SCP(srtMerge = panc8_sub, batch = "tech", integration_method = "Seurat")
CellDimPlot(
srt = panc8_sub, group.by = c("celltype", "tech"), reduction = "SeuratUMAP2D",
title = "Seurat", theme_use = "theme_blank"
)
不矫正以及其余批次矫正算法的结果:
到这里已经有了两个胰腺数据集,虽然是分别小鼠和人两个物种,且一个是胚胎胰腺另一个是成人胰腺,但是许多细胞类型是类似的,理论上相同的细胞类型会有近邻关系。
所以可以对这两个数据集进行细胞映射,映射前首先需要将小鼠基因名称转换成人基因名称(这里为了简单起见直接首字母大写,但推荐使用GeneConvert
进行物种间的基因的ID转换):
panc8_rename <- RenameFeatures(srt = panc8_sub, newnames = make.unique(capitalize(rownames(panc8_sub), force_tolower = TRUE)), assays = "RNA")
srt_query <- RunKNNMap(srt_query = pancreas_sub, srt_ref = panc8_rename, ref_umap = "SeuratUMAP2D")
ProjectionPlot(
srt_query = srt_query, srt_ref = panc8_rename,
query_group = "SubCellType", ref_group = "celltype"
)
当然,相同物种、相同组织的数据集作细胞映射是最佳的:
srt_ref <- panc8_sub[, panc8_sub$tech != "smartseq2"]
srt_query <- panc8_sub[, panc8_sub$tech == "smartseq2"]
srt_query <- RunKNNMap(srt_query = srt_query, srt_ref = srt_ref, ref_umap = "SeuratUMAP2D")
ProjectionPlot(
srt_query = srt_query, srt_ref = srt_ref,
query_group = "celltype", ref_group = "celltype"
)
显然映射结果基本上可以匹配正确的细胞类型,因此很多时候可以利用相同组织已注释的数据集,映射手头的数据,来直观的进行细胞注释。
细胞注释和细胞映射在背后的计算上是类似的,区别在于前者的目标是明确细胞类型(或其他属性)间的对应关系,而不是空间坐标的邻近关系。
SCP准备了人和鼠的参考集scHCL[3]和scMCA[4](行为基因,列为各细胞类型)。我们将scMCA[5]其视作一个Bulkrna-seq来进行细胞类型注释,并过滤掉小于20个细胞的细胞类型并归为unreliable:
data("ref_scMCA")
dim(ref_scMCA)
#> [1] 3028 894
pancreas_sub <- RunKNNPredict(srt_query = pancreas_sub, bulk_ref = ref_scMCA, filter_lowfreq = 20)
CellDimPlot(srt = pancreas_sub, group.by = "KNNPredict_classification", reduction = "UMAP", label = TRUE)
利用之前成人胰腺的scRNA-seq数据也可以直接进行注释,注释可在单个细胞层面或整个细胞类型层面进行:
pancreas_sub <- RunKNNPredict(
srt_query = pancreas_sub, srt_ref = panc8_rename,
ref_group = "celltype", filter_lowfreq = 20
)
CellDimPlot(srt = pancreas_sub, group.by = "KNNPredict_classification", reduction = "UMAP", label = TRUE)
pancreas_sub <- RunKNNPredict(
srt_query = pancreas_sub, srt_ref = panc8_rename,
query_group = "SubCellType", ref_group = "celltype",
return_full_distance_matrix = TRUE
)
CellDimPlot(srt = pancreas_sub, group.by = "KNNPredict_classification", reduction = "UMAP", label = TRUE)
在细胞类型注释时,RunKNNPredict
会使用距离或相似性指标进行两个数据集的细胞间比较或细胞类型间比较,最终确定细胞类型,这里也可以使用CellCorHeatmap
来直接观察细胞类型间的距离或相似性,并标记出前三个最高的相似性值:
ht <- CellCorHeatmap(
srt_query = pancreas_sub, srt_ref = panc8_rename,
query_group = "SubCellType", ref_group = "celltype",
nlabel = 3, label_by = "row",
show_row_names = TRUE, show_column_names = TRUE
)
print(ht$plot)
PAGA分析可以推断出细胞群之间的邻近关系或轨迹,建立好SCP的python环境后,直接运行:
pancreas_sub <- RunPAGA(
srt = pancreas_sub, group_by = "SubCellType",
linear_reduction = "PCA", nonlinear_reduction = "UMAP"
)
PAGAPlot(srt = pancreas_sub, reduction = "UMAP", label = TRUE, label_insitu = TRUE, label_repel = TRUE)
注意,PAGA的结果存储在了pancreas_sub@misc$paga
中。
RNA速率的估计需要Seurat对象含有spliced
和unspliced
两个assay,可以通过names(pancreas_sub@assays)
查看,如果不存在需要使用velocyto[6],bustools[7]或者alevin[8]来生成矩阵并存入Seurat对象中。
在运行RunPAGA
或RunSCVELO
的过程中,如果使用的是Rstudio,则可以在其plot窗口看到相关的输出结果,运行结束后相关的结果将保存在Seurat对象中。
pancreas_sub <- RunSCVELO(
srt = pancreas_sub, group_by = "SubCellType",
linear_reduction = "PCA", nonlinear_reduction = "UMAP"
)
VelocityPlot(srt = pancreas_sub, reduction = "UMAP", group_by = "SubCellType")
VelocityPlot(srt = pancreas_sub, reduction = "UMAP", plot_type = "stream")
SCP中的差异分析使用的是RunDEtest
,其中主要调用了Seurat的FindMarkers
函数,并使用了多线程来完成计算:
pancreas_sub <- RunDEtest(srt = pancreas_sub, group_by = "CellType", fc.threshold = 1, only.pos = FALSE)
VolcanoPlot(srt = pancreas_sub, group_by = "CellType")
所有的差异分析结果存储在pancreas_sub@tools
中,可以进行进一步的筛选出差异具有显著性的基因,并对这些基因注释上转录因子和表面蛋白,最后可视化:
DEGs <- pancreas_sub@tools$DEtest_CellType$AllMarkers_wilcox
DEGs <- DEGs[with(DEGs, avg_log2FC > 1 & p_val_adj < 0.05), ]
# Annotate features with transcription factors and surface proteins
pancreas_sub <- AnnotateFeatures(pancreas_sub, species = "Mus_musculus", db = c("TF", "CSPA"))
ht <- FeatureHeatmap(
srt = pancreas_sub, group.by = "CellType", features = DEGs$gene, feature_split = DEGs$group1,
species = "Mus_musculus", db = c("GO_BP", "KEGG", "WikiPathway"), anno_terms = TRUE,
feature_annotation = c("TF", "CSPA"), feature_annotation_palcolor = list(c("gold", "steelblue"), c("forestgreen")),
height = 5, width = 4
)
print(ht$plot)
上面的热图中同时也做了富集分析,也可以使用RunEnrichment
单独进行计算,这里我们筛选出1.5倍foldchange且p_val_adj<0.05的差异基因做富集和不同类型的可视化(默认只显示p.adjust<0.05的显著性富集的事件):
pancreas_sub <- RunEnrichment(
srt = pancreas_sub, group_by = "CellType", db = "GO_BP", species = "Mus_musculus",
DE_threshold = "avg_log2FC > log2(1.5) & p_val_adj < 0.05"
)
EnrichmentPlot(
srt = pancreas_sub, group_by = "CellType", group_use = c("Ductal", "Endocrine"),
plot_type = "bar"
)
EnrichmentPlot(
srt = pancreas_sub, group_by = "CellType", group_use = c("Ductal", "Endocrine"),
plot_type = "wordcloud"
)
EnrichmentPlot(
srt = pancreas_sub, group_by = "CellType", group_use = c("Ductal", "Endocrine"),
plot_type = "wordcloud", word_type = "feature"
)
EnrichmentPlot(
srt = pancreas_sub, group_by = "CellType", group_use = "Ductal",
plot_type = "network"
)
注意:需要调整plot窗口,来显示出所有的事件标签
EnrichmentPlot(
srt = pancreas_sub, group_by = "CellType", group_use = "Ductal",
plot_type = "enrichmap"
)
EnrichmentPlot(srt = pancreas_sub, group_by = "CellType", plot_type = "comparison")
之前运行RunDEtest时fc.threshold为1,也就是foldchange不作为基因的筛选标准,只使用其余标准进行筛选,例如min.pct>0.1。差异检验后可以将p_val_adj< 0.05的所有基因取出,作为候选基因集进行GSEA分析:
pancreas_sub <- RunGSEA(
srt = pancreas_sub, group_by = "CellType", db = "GO_BP", species = "Mus_musculus",
DE_threshold = "p_val_adj < 0.05"
)
GSEAPlot(srt = pancreas_sub, group_by = "CellType", group_use = "Endocrine", id_use = "GO:0007186")
GSEAPlot(srt = pancreas_sub, group_by = "CellType", plot_type = "comparison")
ORA富集分析和GSEA富集分析的结果也均存于pancreas_sub@tools
中
SCP支持Slingshot, PAGA, scVelo, Palantir, Monocle2, Monocle3,WOT等轨迹推断方法,不过不论是什么方法都需要基于生物学知识库对算法进行合理的限制,对结果进行合理的裁剪。
以slingshot为例进行轨迹推断:
pancreas_sub <- RunSlingshot(srt = pancreas_sub, group.by = "SubCellType", reduction = "UMAP")
在这里Slingshot推断出了三条发育轨迹/谱系,起点均为Ductal/Ngn3 lowEP细胞群,终点分别指向了三种内分泌细胞群Alpha,Beta以及Delta。分别查看三条轨迹的pseudotime:
FeatureDimPlot(pancreas_sub,
features = paste0("Lineage", 1:3),
reduction = "UMAP", theme_use = "theme_blank"
)
灰色的细胞表示不在对应的发育轨迹,其对应轨迹的pesudotime=NA
。
轨迹的可视化可以根据这些pesudotime调整:
CellDimPlot(pancreas_sub,
group.by = "SubCellType", reduction = "UMAP",
lineages = paste0("Lineage", 1:3), lineages_span = 0.1
)
对于任意算法得到的轨迹pseudotime,都可以使用RunDynamicFeatures
计算出与之相关的动态基因(驱动细胞沿轨迹分化发育的基因)。
对于多条轨迹也可以同时计算动态基因并比较:
pancreas_sub <- RunDynamicFeatures(srt = pancreas_sub, lineages = c("Lineage1", "Lineage2"), n_candidates = 200)
ht <- DynamicHeatmap(
srt = pancreas_sub, lineages = c("Lineage1", "Lineage2"),
use_fitted = TRUE, n_split = 6, reverse_ht = "Lineage1",
species = "Mus_musculus", db = "GO_BP", anno_terms = TRUE, anno_keys = TRUE, anno_features = TRUE,
heatmap_palette = "viridis", cell_annotation = "SubCellType",
separate_annotation = list("SubCellType", c("Nnat", "Irx1")), separate_annotation_palette = c("Paired", "Set1"),
feature_annotation = c("TF", "CSPA"), feature_annotation_palcolor = list(c("gold", "steelblue"), c("forestgreen")),
pseudotime_label = 25, pseudotime_label_color = "red",
height = 5, width = 2
)
print(ht$plot)
或逐一查询基因在轨迹中的动态变化:
DynamicPlot(
srt = pancreas_sub, lineages = c("Lineage1", "Lineage2"), group.by = "SubCellType",
features = c("Plk1", "Hes1", "Neurod2", "Ghrl", "Gcg", "Ins2"),
compare_lineages = TRUE, compare_features = FALSE
)
FeatureStatPlot(
srt = pancreas_sub, group.by = "SubCellType", bg.by = "CellType",
stat.by = c("Sox9", "Neurod2", "Isl1", "Rbp4"), add_box = TRUE,
comparisons = list(
c("Ductal", "Ngn3 low EP"),
c("Ngn3 high EP", "Pre-endocrine"),
c("Alpha", "Beta")
)
)
最后通过SCExplorer,可以快速的共享单细胞数据到云端,实现交互式的数据查询:
PrepareSCExplorer(list(mouse_pancreas = pancreas_sub, human_pancreas = panc8_sub), base_dir = "./SCExplorer")
app <- RunSCExplorer(base_dir = "./SCExplorer")
list.files("./SCExplorer") # This directory can be used as site directory for Shiny Server.
if (interactive()) {
shiny::runApp(app)
}
最后分享几个常用SCP的可视化方法及其示例:
CellDimPlot[9]
CellStatPlot[10]
FeatureStatPlot[11]
GroupHeatmap[12]
BiocParallel: https://bioconductor.org/packages/release/bioc/html/BiocParallel.html
[2]future: https://github.com/HenrikBengtsson/future
[3]scHCL: https://github.com/ggjlab/scHCL
[4]scMCA: https://github.com/ggjlab/scMCA
[5]scMCA: https://github.com/ggjlab/scMCA
[6]velocyto: http://velocyto.org/velocyto.py/index.html
[7]bustools: https://bustools.github.io/BUS_notebooks_R/velocity.html
[8]alevin: https://combine-lab.github.io/alevin-fry-tutorials/2021/alevin-fry-velocity/
[9]CellDimPlot: https://zhanghao-njmu.github.io/SCP/reference/CellDimPlot.html
[10]CellStatPlot: https://zhanghao-njmu.github.io/SCP/reference/CellStatPlot.html
[11]FeatureStatPlot: https://zhanghao-njmu.github.io/SCP/reference/FeatureStatPlot.html
[12]GroupHeatmap: https://zhanghao-njmu.github.io/SCP/reference/GroupHeatmap.html
如果你对单细胞转录组研究感兴趣,但又不知道如何入门,也许你可以关注一下下面的课程
看完记得顺手点个“在看”哦!
长按扫码可关注
https://mp.weixin.qq.com/s/2IbHV-UB4bJtQnnTobZhzw