ixxmu / mp_duty

抓取网络文章到github issues保存
https://archives.duty-machine.now.sh/
116 stars 30 forks source link

通路反应基因活性推断工具:PROGENy(二) #3031

Closed ixxmu closed 1 year ago

ixxmu commented 1 year ago

https://mp.weixin.qq.com/s/5QSOaHvz__xoMBRuljKuSg

ixxmu commented 1 year ago

通路反应基因活性推断工具:PROGENy(二) by 生信菜鸟团

接上篇,PROGENy: Pathway RespOnsive GENes for activity inference(一)

Title:PROGENy: Pathway RespOnsive GENes for activity inference; 通路反应基因活性推断工具

在上篇中,主要介绍了PROGENy在Bulk RNA seq数据中的应用,以及PROGENy如何联系药物与信号通路。这篇继续探讨PROGENy在单细胞数据中的应用。

4.0  PROGENy在单细胞转录组中的应用

要运行PROGENy的单细胞版本,首先要获得演示数据。10X genomics 存储的pbmc3k的数据可以直接获得。

因此,我们先对该数据进行前期Seurat 处理10x genomics pbmc3k存储的链接是https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz如果在服务器上使用,可以直接用wget下载:

wget -b -c https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

而后进行Seurat流程:

library(Seurat)
library(ggplot2)
library(tidyr)
library(readr)
library(pheatmap)
library(tibble)
pbmc.data <-
Read10X(data.dir = "./filtered_gene_bc_matrices/hg19/")
## Initialize the Seurat object with the raw (non-normalized data).
pbmc <-
CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3,
min.features = 200)
## Identification of mithocondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

## Filtering cells following standard QC criteria.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 &
percent.mt < 5)

## Normalizing the data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize",
scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

## Identify the 2000 most highly variable genes
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

## In addition we scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:10, umap.method = 'uwot', metric='cosine')
DimPlot(pbmc, reduction = "umap")
图1  pbmc3k umap

## Finding differentially expressed features (cluster biomarkers)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25, verbose = FALSE)

## Assigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T",
"FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)

## We create a data frame with the specification of the cells that belong to
## each cluster to match with the Progeny scores.
CellsClusters <- data.frame(Cell = names(Idents(pbmc)),
CellType = as.character(Idents(pbmc)),
stringsAsFactors = FALSE)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

图2  pbmc3k  注释后 umap

在完成了细胞注释之后,可对pbmc3k 的样本进行PROGENy 分析。

## We compute the Progeny activity scores and add them to our Seurat object
## as a new assay called Progeny.
pbmc <- progeny(pbmc, scale=FALSE, organism="Human", top=500, perm=1,
return_assay = TRUE)

## We can now directly apply Seurat functions in our Progeny scores.
## For instance, we scale the pathway activity scores.
pbmc <- Seurat::ScaleData(pbmc, assay = "progeny")

## We transform Progeny scores into a data frame to better handling the results
progeny_scores_df <-
as.data.frame(t(GetAssayData(pbmc, slot = "scale.data",
assay = "progeny"))) %>%
rownames_to_column("Cell") %>%
gather(Pathway, Activity, -Cell)

## We match Progeny scores with the cell clusters.
progeny_scores_df <- inner_join(progeny_scores_df, CellsClusters)

## We summarize the Progeny scores by cellpopulation
summarized_progeny_scores <- progeny_scores_df %>%
group_by(Pathway, CellType) %>%
summarise(avg = mean(Activity), std = sd(Activity))

## We prepare the data for the plot
summarized_progeny_scores_df <- summarized_progeny_scores %>%
dplyr::select(-std) %>%
spread(Pathway, avg) %>%
data.frame(row.names = 1, check.names = FALSE, stringsAsFactors = FALSE)

paletteLength = 100
myColor = colorRampPalette(c("Darkblue", "white","red"))(paletteLength)

progenyBreaks = c(seq(min(summarized_progeny_scores_df), 0,
length.out=ceiling(paletteLength/2) + 1),
seq(max(summarized_progeny_scores_df)/paletteLength,
max(summarized_progeny_scores_df),
length.out=floor(paletteLength/2)))
progeny_hmap = pheatmap(t(summarized_progeny_scores_df[,-1]),fontsize=14,
fontsize_row = 10,
color=myColor, breaks = progenyBreaks,
main = "PROGENy (500)", angle_col = 45,
treeheight_col = 0, border_color = NA)

图3  PROGENy 单细胞结果展示



  5.  一些函数的使用  

在v1.20版本里面,除progeny 函数之外,PROGENy包也存在其他函数,如下:

但我们在使用过程中并没有太多的用到这些函数,progeny函数是这个包的核心,那么其他的函数是用来干嘛的呢?

查看progeny 的源码,我们发现,getModel函数是从作者提供的Human 及Mouse 中取出目标通路的基因集,

progeny = function(expr, scale=TRUE, organism="Human", top = 100, perm = 1, 
verbose = FALSE, z_scores = FALSE, get_nulldist = FALSE, assay_name = "RNA",
return_assay = FALSE, ...) {
UseMethod("progeny")
}
methods("progeny")
#> [1] progeny.default* progeny.ExpressionSet* progeny.matrix* progeny.Seurat* progeny.SingleCellExperiment*
#> see '?methods' for accessing help and source code
getAnywhere(progeny.matrix)
#> A single object matching ‘progeny.matrix’ was found
#> It was found in the following places
#> registered S3 method for progeny from namespace progeny
#> namespace:progeny
#> with value
#>
#> function (expr, scale = TRUE, organism = "Human", top = 100,
#> perm = 1, verbose = FALSE, z_scores = FALSE, get_nulldist = FALSE,
#> ...)
#> {
#> if (!is.logical(scale)) {
#> stop("scale should be a logical value")
#> }
#> if (!(is.numeric(perm)) || perm < 1) {
#> stop("perm should be an integer value")
#> }
#> if (!is.logical(verbose)) {
#> stop("verbose should be a logical value")
#> }
#> if (!is.logical(z_scores)) {
#> stop("z_scores should be a logical value")
#> }
#> if (!is.logical(get_nulldist)) {
#> stop("get_nulldist should be a logical value")
#> }
#> if (perm == 1 && (z_scores || get_nulldist)) {
#> if (verbose) {
#> message("z_scores and get_nulldist are only applicable when the\n number of permutations is larger than 1.")
#> }
#> }
#> model <- getModel(organism, top = top)
#> common_genes <- intersect(rownames(expr), rownames(model))
#> if (verbose) {
#> number_genes <- apply(model, 2, function(x) {
#> sum(rownames(model)[which(x != 0)] %in% unique(rownames(expr)))
#> })
#> message("Number of genes used per pathway to compute progeny scores:")
#> message(paste(names(number_genes), ": ", number_genes,
#> " (", (number_genes/top) * 100, "%)", sep = "", "\n"))
#> }
#> if (perm == 1) {
#> result <- t(expr[common_genes, , drop = FALSE]) %*% as.matrix(model[common_genes,
#> , drop = FALSE])
#> if (scale && nrow(result) > 1) {
#> rn <- rownames(result)
#> result <- apply(result, 2, scale)
#> rownames(result) <- rn
#> }
#> }
#> else if (perm > 1) {
#> expr <- data.frame(names = row.names(expr), row.names = NULL,
#> expr)
#> model <- data.frame(names = row.names(model), row.names = NULL,
#> model)
#> result <- progenyPerm(expr, model, k = perm, z_scores = z_scores,
#> get_nulldist = get_nulldist)
#> }
#> return(result)
#> }
#> <bytecode: 0x5648eb3c2ba0>
#> <environment: namespace:progeny>

其他函数可自行在帮助文档查阅各自功能。

附:当我们找不到说明文档怎么办?

在本软件学习过程中,遇到最明显的bug 就是作者在更新PROGENy之后,以往的帮助文档网页丢失了,这给学习使用R包的我们带来麻烦,这里简要分享我如何寻找这篇文章的帮助文档。

Bioconductor

一般情况下,Bioconductor R包在这个部分会提供说明文档的链接,点开HTML/R_Script 或者查看PDF都可以看到帮助文档,在这里,HTML(https://www.bioconductor.org/packages/release/bioc/vignettes/progeny/inst/doc/progeny.html)并没有太多使用信息,只是介绍了getModel 是怎么用的,而PDF(https://www.bioconductor.org/packages/release/bioc/manuals/progeny/man/progeny.pdf)也只是提供了各个函数的介绍。也就是说,从这里并没有获取到很有价值的信息。

图4. PROGENy Bioconductor 界面

参照Bioconductor的获取帮助文档的方式也失败了。

browseVignettes("progeny")
#> No vignettes found by browseVignettes("progeny")

那其实也可以检索一下旧版本的progeny 有没有配套的使用说明。但如果你也进行了这一步,你会发现,这个链接点开啥也没有,而在其他版本的HTML文件中,使用信息也不是那么全。

Github

常见的说明文档也会提供在Github (https://github.com/saezlab/progeny)的首页,但在这里并没有。

因此,下一步查看他的各个文件夹中里面有没有帮助文档。一般来说vignettes 文件夹是储存帮助文档的,感兴趣的可以试着点开看一下

我这里直接说结果,在docs 的下级目录中找到了这两个文件。如果你在github中打开,要是像我一样小白的话,可能看不懂它网页的源代码格式的数据,因此,就直接把他download下来,用本地浏览器打开,你会发现这就是一份操作说明。

结语

总体而言,PROGENy 是个很优秀的R包,但也有其不足的地方,比如,它只涵盖了14条通路。上篇写完才发现,原来已经有其他公众号上已经推送过PROGENy,还引用了文献的例子来介绍PROGENy的使用,读者可以互相参考借鉴,由于来源一致,大部分内容可能会比较趋同。如果对R包原理感兴趣,还是要移步作者原文仔细详读才行哦。

[1] Schubert M, Klinger B, Klünemann M, Sieber A, Uhlitz F, Sauer S, Garnett MJ, Blüthgen N, Saez-Rodriguez J. 2018. Perturbation-response genes reveal signaling footprints in cancer gene expression. *Nature Communications*: [10.1038/s41467-017-02391-6](https://doi.org/10.1038/s41467-017-02391-6)

[2] Holland CH, Szalai B, Saez-Rodriguez J. Transfer of regulatory knowledge from human to mouse for functional genomics analysis. Biochim Biophys Acta Gene Regul Mech. 2020;1863(6):194431. doi:10.1016/j.bbagrm.2019.194431

[3] Holland, C.H., Tanevski, J., Perales-Patón, J. et al. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol 21, 36 (2020). https://doi.org/10.1186/s13059-020-1949-z


- END -