Closed ixxmu closed 5 months ago
jimmy老师在他的生信游民系列提出了很多学徒任务,仔细去看每一篇文章的动机,都是科研过程中经常会经历的。
今天我们分享的任务是来自:生物学功能注释三板斧。 本次任务我计划分为四个部分来写,今天写第一部分。
1.DoRothEA--单细胞转录因子评分
2.PROGENy--单细胞通路活性评分
3.DoRothEA--转录组转录因子评分
4.PROGENy--转录组通路活性评分
正文部分
通常我们都是利用基因集合去做单细胞评分,那么是否可以通过有权重的基因集来计算细胞的转录活性?这就是DoRothEA开发出来的意义。
CollecTRI 来源的调控子总结了从 12 种不同 转录因子(TF)-目标基因相互作用的数据库。该集合扩大了转录因子的覆盖范围,并与其他已知的 GRN 进行了比较。结果表明,在使用 knockTF 数据集根据基因表达数据识别受干扰的 TF 方面,该集合具有卓越的性能。
简单来说,CollecTRI 是第二代DoRothEA,对DoRothEA做了更新。
load("~/gzh/pbmc3k_final_v4.rds")
#1 加载数据集合----
## We read Dorothea Regulons for Human:
dorothea_regulon_human <- get(data("dorothea_hs", package = "dorothea"))
##如果是小鼠,就用
##dorothea_regulon_mouse <- get(data("dorothea_mm", package = "dorothea"))
#1 过滤数据集的置信度----
## We obtain the regulons based on interactions with confidence level A, B and C
regulon <- dorothea_regulon_human %>%
dplyr::filter(confidence %in% c("A","B","C"))
#类似结果,作者现在更推荐collectri的结果!
net <- decoupleR::get_collectri(split_complexes = T,organism='human' )
head(net)
regulon
作者现在更推荐使用collectri的数据
#2计算细胞的TF活性----
## We compute Viper Scores
pbmc <- run_viper(pbmc, regulon,
options = list(method = "scale", minsize = 4,
eset.filter = FALSE, cores = 4,
verbose = FALSE))
pbmc@assays$dorothea
计算出得分之后,后续就是考验个人的可视化功底了!
可视化方案一
#3 对dorothea矩阵进行降维聚类分群-----
## We compute the Nearest Neighbours to perform cluster
DefaultAssay(object = pbmc) <- "dorothea"
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc, features = rownames(pbmc), verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:10, umap.method = "uwot", metric = "cosine")
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25, verbose = FALSE)
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(10, avg_log2FC);top10
pbmc@assays$dorothea@data[1:4,1:4]
top10 =top10[top10$avg_log2FC > 0.3,]
DoHeatmap(pbmc ,top10$gene,size=3,slot = 'scale.data')
从热图可以看出,其实还是有一大部分TF在不同的细胞类型中特异性表达的。
如果想把热图画好看一点,可以参考:优雅地展示20w单细胞热图|非Doheatmap
d.all=pbmc
head(d.all@meta.data)
Idents(d.all)=d.all$cell.type
a=AverageExpression(d.all,return.seurat = TRUE)
#可以用 a=AggregateExpression(d.all,return.seurat = TRUE)试试看哦
a$orig.ident=rownames(a@meta.data)
head(a@meta.data);a
markers=pbmc.markers ; head(markers)
markers$cluster=factor(markers$cluster,levels = unique(markers$cluster) )
DoHeatmap(a,draw.lines = FALSE, slot = 'scale.data',assay = 'dorothea',
features = markers %>%group_by(cluster) %>%dplyr::slice_max(avg_log2FC,n = 5) %>% .$gene )
上面进行可视化的TF都是通过findmarkers找到的,我们还可以根据std值找TF
制备画图所有数据
#4获取Viper得分矩阵,评价细胞群的TF活性----
## We transform Viper scores, scaled by seurat, into a data frame to better
## handling the results
viper_scores_df <- GetAssayData(pbmc, slot = "scale.data",
assay = "dorothea") %>%
data.frame(check.names = F) %>%
t();viper_scores_df[1:4,1:4]
## We create a data frame containing the cells and their clusters
CellsClusters <- data.frame(cell = names(Idents(pbmc)),
cell_type = as.character(Idents(pbmc)),
check.names = F) #也可以使用其他的分类信息
CellsClusters[1:4, ]
## We create a data frame with the Viper score per cell and its clusters
viper_scores_clusters <- viper_scores_df %>%
data.frame() %>%
rownames_to_column("cell") %>%
gather(tf, activity, -cell) %>%
inner_join(CellsClusters);viper_scores_clusters[1:4,]
## We summarize the Viper scores by cellpopulation
summarized_viper_scores <- viper_scores_clusters %>%
group_by(tf, cell_type) %>%
summarise(avg = mean(activity),
std = sd(activity));summarized_viper_scores
var(1,3 )
5#5细胞群间变化最大的20个TFs进行可视化----
## We select the 20 most variable TFs. (20*9 populations = 180) 9个细胞亚群
highly_variable_tfs <- summarized_viper_scores %>%
group_by(tf) %>%
mutate(var = var(avg)) %>% # 计算变异度 calculates the variance of the "avg" column. So, for each transcription factor, it computes the variance of its average score.
ungroup() %>%
top_n(180, var) %>%
distinct(tf);highly_variable_tfs
## We prepare the data for the plot
summarized_viper_scores_df <- summarized_viper_scores %>%
semi_join(highly_variable_tfs, by = "tf") %>%
dplyr::select(-std) %>%
spread(tf, avg) %>%
data.frame(row.names = 1, check.names = FALSE) ;summarized_viper_scores_df
palette_length = 100
my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)
my_breaks <- c(seq(min(summarized_viper_scores_df), 0,
length.out=ceiling(palette_length/2) + 1),
seq(max(summarized_viper_scores_df)/palette_length,
max(summarized_viper_scores_df),
length.out=floor(palette_length/2)))
viper_hmap <- pheatmap(t(summarized_viper_scores_df),fontsize=14,
fontsize_row = 10,
color=my_color, breaks = my_breaks,
main = "DoRothEA (ABC)", angle_col = 45,
treeheight_col = 0, border_color = NA)
遗憾的是在这种可视化方案中,效果似乎不太好。
这也可以理解,毕竟这是转录因子活性分析,看的是转录因子下游基因的表达情况,而不是转录因子本身。所以对转录因子本身进行可视化似乎不是一个好的选择。
Idents(pbmc)='RNA'
pbmc@meta.data$GATA1= pbmc@assays$dorothea@scale.data['GATA1',]
FeaturePlot(pbmc,features = "GATA1",reduction = "umap" )| DimPlot(pbmc,label=T,group.by = 'cell.type')
FeaturePlot(pbmc,features = "PBX2",reduction = "umap" )| DimPlot(pbmc,label=T,group.by = 'cell.type')
参考:
https://www.jianshu.com/p/49617173a8ad
https://bioconductor.org/packages/release/data/experiment/vignettes/dorothea/inst/doc/dorothea.html
https://github.com/saezlab/CollecTRI
https://github.com/saezlab/CollecTRI/blob/main/scripts/casestudy/02.2_pbmc_TFactivity.R
https://mp.weixin.qq.com/s/XjVqpSfR-NBN3NkI_cEOGw