通路反应基因活性推断工具：PROGENy（二） by 生信菜鸟团

接上篇，PROGENy: Pathway RespOnsive GENes for activity inference（一）

Title：PROGENy: Pathway RespOnsive GENes for activity inference; 通路反应基因活性推断工具

在上篇中，主要介绍了PROGENy在Bulk RNA seq数据中的应用，以及PROGENy如何联系药物与信号通路。这篇继续探讨PROGENy在单细胞数据中的应用。

4.0 PROGENy在单细胞转录组中的应用

要运行PROGENy的单细胞版本，首先要获得演示数据。10X genomics 存储的pbmc3k的数据可以直接获得。

因此，我们先对该数据进行前期Seurat 处理10x genomics pbmc3k存储的链接是https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz如果在服务器上使用，可以直接用wget下载：

wget -b -c https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

而后进行Seurat流程：

library(Seurat)
library(ggplot2)
library(tidyr)
library(readr)
library(pheatmap)
library(tibble)
pbmc.data <- 
  Read10X(data.dir = "./filtered_gene_bc_matrices/hg19/")
## Initialize the Seurat object with the raw (non-normalized data).
pbmc <- 
  CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, 
                     min.features = 200)
## Identification of mithocondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

## Filtering cells following standard QC criteria.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & 
    percent.mt < 5)

## Normalizing the data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
    scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

## Identify the 2000 most highly variable genes
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

## In addition we scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc), verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:10,  umap.method = 'uwot', metric='cosine')
DimPlot(pbmc, reduction = "umap")

## Finding differentially expressed features (cluster biomarkers)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, 
    logfc.threshold = 0.25, verbose = FALSE)

## Assigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", 
    "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)

## We create a data frame with the specification of the cells that belong to 
## each cluster to match with the Progeny scores.
CellsClusters <- data.frame(Cell = names(Idents(pbmc)), 
    CellType = as.character(Idents(pbmc)),
    stringsAsFactors = FALSE)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

图2 pbmc3k 注释后 umap

在完成了细胞注释之后，可对pbmc3k 的样本进行PROGENy 分析。

## We compute the Progeny activity scores and add them to our Seurat object
## as a new assay called Progeny. 
pbmc <- progeny(pbmc, scale=FALSE, organism="Human", top=500, perm=1, 
    return_assay = TRUE)

## We can now directly apply Seurat functions in our Progeny scores. 
## For instance, we scale the pathway activity scores. 
pbmc <- Seurat::ScaleData(pbmc, assay = "progeny") 

## We transform Progeny scores into a data frame to better handling the results
progeny_scores_df <- 
    as.data.frame(t(GetAssayData(pbmc, slot = "scale.data", 
        assay = "progeny"))) %>%
    rownames_to_column("Cell") %>%
    gather(Pathway, Activity, -Cell) 

## We match Progeny scores with the cell clusters.
progeny_scores_df <- inner_join(progeny_scores_df, CellsClusters)

## We summarize the Progeny scores by cellpopulation
summarized_progeny_scores <- progeny_scores_df %>% 
    group_by(Pathway, CellType) %>%
    summarise(avg = mean(Activity), std = sd(Activity))

## We prepare the data for the plot
summarized_progeny_scores_df <- summarized_progeny_scores %>%
    dplyr::select(-std) %>%   
    spread(Pathway, avg) %>%
    data.frame(row.names = 1, check.names = FALSE, stringsAsFactors = FALSE) 

paletteLength = 100
myColor = colorRampPalette(c("Darkblue", "white","red"))(paletteLength)

progenyBreaks = c(seq(min(summarized_progeny_scores_df), 0, 
                      length.out=ceiling(paletteLength/2) + 1),
                  seq(max(summarized_progeny_scores_df)/paletteLength, 
                      max(summarized_progeny_scores_df), 
                      length.out=floor(paletteLength/2)))
progeny_hmap = pheatmap(t(summarized_progeny_scores_df[,-1]),fontsize=14, 
                        fontsize_row = 10, 
                        color=myColor, breaks = progenyBreaks, 
                        main = "PROGENy (500)", angle_col = 45,
                        treeheight_col = 0,  border_color = NA)

图3 PROGENy 单细胞结果展示

5. 一些函数的使用

在v1.20版本里面，除progeny 函数之外，PROGENy包也存在其他函数，如下：

但我们在使用过程中并没有太多的用到这些函数，progeny函数是这个包的核心，那么其他的函数是用来干嘛的呢？

查看progeny 的源码，我们发现，getModel函数是从作者提供的Human 及Mouse 中取出目标通路的基因集,

progeny = function(expr, scale=TRUE, organism="Human", top = 100, perm = 1, 
    verbose = FALSE, z_scores = FALSE, get_nulldist = FALSE, assay_name = "RNA", 
    return_assay = FALSE, ...) {
    UseMethod("progeny")
}
methods("progeny")
#> [1] progeny.default*              progeny.ExpressionSet*        progeny.matrix*               progeny.Seurat*               progeny.SingleCellExperiment*
#> see '?methods' for accessing help and source code
getAnywhere(progeny.matrix)
#> A single object matching ‘progeny.matrix’ was found
#> It was found in the following places
#>   registered S3 method for progeny from namespace progeny
#>   namespace:progeny
#> with value
#> 
#> function (expr, scale = TRUE, organism = "Human", top = 100, 
#>     perm = 1, verbose = FALSE, z_scores = FALSE, get_nulldist = FALSE, 
#>     ...) 
#> {
#>     if (!is.logical(scale)) {
#>         stop("scale should be a logical value")
#>    }
#>     if (!(is.numeric(perm)) || perm < 1) {
#>         stop("perm should be an integer value")
#>     }
#>     if (!is.logical(verbose)) {
#>         stop("verbose should be a logical value")
#>     }
#>     if (!is.logical(z_scores)) {
#>         stop("z_scores should be a logical value")
#>     }
#>     if (!is.logical(get_nulldist)) {
#>         stop("get_nulldist should be a logical value")
#>     }
#>     if (perm == 1 && (z_scores || get_nulldist)) {
#>         if (verbose) {
#>             message("z_scores and get_nulldist are only applicable when the\n                number of permutations is larger than 1.")
#>         }
#>     }
#>     model <- getModel(organism, top = top)
#>     common_genes <- intersect(rownames(expr), rownames(model))
#>     if (verbose) {
#>         number_genes <- apply(model, 2, function(x) {
#>             sum(rownames(model)[which(x != 0)] %in% unique(rownames(expr)))
#>         })
#>         message("Number of genes used per pathway to compute progeny scores:")
#>         message(paste(names(number_genes), ": ", number_genes, 
#>             " (", (number_genes/top) * 100, "%)", sep = "", "\n"))
#>     }
#>     if (perm == 1) {
#>         result <- t(expr[common_genes, , drop = FALSE]) %*% as.matrix(model[common_genes, 
#>             , drop = FALSE])
#>         if (scale && nrow(result) > 1) {
#>             rn <- rownames(result)
#>             result <- apply(result, 2, scale)
#>             rownames(result) <- rn
#>         }
#>     }
#>     else if (perm > 1) {
#>         expr <- data.frame(names = row.names(expr), row.names = NULL, 
#>             expr)
#>         model <- data.frame(names = row.names(model), row.names = NULL, 
#>             model)
#>         result <- progenyPerm(expr, model, k = perm, z_scores = z_scores, 
#>             get_nulldist = get_nulldist)
#>     }
#>     return(result)
#> }
#> <bytecode: 0x5648eb3c2ba0>
#> <environment: namespace:progeny>

其他函数可自行在帮助文档查阅各自功能。

附：当我们找不到说明文档怎么办？

在本软件学习过程中，遇到最明显的bug 就是作者在更新PROGENy之后，以往的帮助文档网页丢失了，这给学习使用R包的我们带来麻烦，这里简要分享我如何寻找这篇文章的帮助文档。

Bioconductor

一般情况下，Bioconductor R包在这个部分会提供说明文档的链接，点开HTML/R_Script 或者查看PDF都可以看到帮助文档，在这里，HTML（https://www.bioconductor.org/packages/release/bioc/vignettes/progeny/inst/doc/progeny.html）并没有太多使用信息，只是介绍了getModel 是怎么用的，而PDF（https://www.bioconductor.org/packages/release/bioc/manuals/progeny/man/progeny.pdf）也只是提供了各个函数的介绍。也就是说，从这里并没有获取到很有价值的信息。图4. PROGENy Bioconductor 界面

图4. PROGENy Bioconductor 界面

参照Bioconductor的获取帮助文档的方式也失败了。

browseVignettes("progeny")
#> No vignettes found by browseVignettes("progeny")

那其实也可以检索一下旧版本的progeny 有没有配套的使用说明。但如果你也进行了这一步，你会发现，这个链接点开啥也没有，而在其他版本的HTML文件中，使用信息也不是那么全。

Github

常见的说明文档也会提供在Github （https://github.com/saezlab/progeny）的首页，但在这里并没有。

因此，下一步查看他的各个文件夹中里面有没有帮助文档。一般来说vignettes 文件夹是储存帮助文档的，感兴趣的可以试着点开看一下

我这里直接说结果，在docs 的下级目录中找到了这两个文件。如果你在github中打开，要是像我一样小白的话，可能看不懂它网页的源代码格式的数据，因此，就直接把他download下来，用本地浏览器打开，你会发现这就是一份操作说明。

结语

总体而言，PROGENy 是个很优秀的R包，但也有其不足的地方，比如，它只涵盖了14条通路。上篇写完才发现，原来已经有其他公众号上已经推送过PROGENy，还引用了文献的例子来介绍PROGENy的使用，读者可以互相参考借鉴，由于来源一致，大部分内容可能会比较趋同。如果对R包原理感兴趣，还是要移步作者原文仔细详读才行哦。

[1] Schubert M, Klinger B, Klünemann M, Sieber A, Uhlitz F, Sauer S, Garnett MJ, Blüthgen N, Saez-Rodriguez J. 2018. Perturbation-response genes reveal signaling footprints in cancer gene expression. *Nature Communications*: [10.1038/s41467-017-02391-6](https://doi.org/10.1038/s41467-017-02391-6)

[2] Holland CH, Szalai B, Saez-Rodriguez J. Transfer of regulatory knowledge from human to mouse for functional genomics analysis. Biochim Biophys Acta Gene Regul Mech. 2020;1863(6):194431. doi:10.1016/j.bbagrm.2019.194431 

[3] Holland, C.H., Tanevski, J., Perales-Patón, J. et al. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol 21, 36 (2020). https://doi.org/10.1186/s13059-020-1949-z

- END -

ixxmu / mp_duty

通路反应基因活性推断工具：PROGENy（二） #3031

通路反应基因活性推断工具：PROGENy（二） by 生信菜鸟团

4.0 PROGENy在单细胞转录组中的应用

附：当我们找不到说明文档怎么办？

Bioconductor

Github

结语