R包BioEnricher：一键式富集分析和可视化

R包BioEnricher：一键式富集分析和可视化 by 生信菜鸟团

BioEnricher包是我的好朋友【医学僧公众号】的博主花了很大功夫写的一个R包，现在无偿的分享给大家~

BioEnricher的关键在于解决两个问题：

首先，BioEnricher提供了富集分析的无缝集成，涵盖了各种富集分析基因集数据库，如GO、KEGG、WikiPathways、Reactome、MsigDB、Disease Ontology、Cancer Gene Network、DisGeNET、CellMarker和CMAP（药物）；
其次，BioEnricher提供了各种高级的可视化功能（颜值都很高，发表级的可视化方案），简化了数据呈现的过程，使其更快速、更便捷。

BioEnricher的GitHub在https://github.com/Zaoqu-Liu/BioEnricher，记得去官方github给作者点点小星星，支持一波~

目前R包安装暂不支持install_github：

devtools::install_github("Zaoqu-Liu/BioEnricher")

可以自行下载BioEnricher_0.1.0.zip文件，然后使用本地安装的方式进行安装：

https://github.com/Zaoqu-Liu/BioEnricher/tree/master

install.packages("~/BioEnricher/BioEnricher-master/BioEnricher_0.1.0.zip", repos = NULL, type = "win.binary")

示例数据实战

Step1. 获取基因列表

经常做富集分析的小伙伴都知道，富集分析一般有两种形式：

ORA富集分析，也就是基于差异基因（或者其他形式，例如WGCNA，NMF等）的富集分析，需要我们有一个感兴趣的基因列表，列表里的基因无需排序；
GSEA富集分析，也就是基于Rank排序的富集分析，需要我们事先按照某种排序标准对基因列表进行排序，例如进行差异表达分析得到的所有的基因，按照LogFoldChange对基因进行排序。

这里我们加载BioEnricher包提供的示例数据进行差异分析，得到差异基因列表，为后续的富集分析实战做准备：

library(airway)
library(DESeq2)
library(tidyverse)
library(clusterProfiler)
library(org.Hs.eg.db)
library(BioEnricher)
data(airway)
se <- airway
se$dex <- relevel(se$dex, "untrt") 
res <- DESeqDataSet(se, design = ~ cell + dex)%>%
  estimateSizeFactors()%>%DESeq()%>%
  results()%>%as.data.frame()%>%na.omit()
ann <- bitr(rownames(res),'ENSEMBL','SYMBOL',org.Hs.eg.db)
res <- merge(ann,res,by.x=1,by.y=0)%>%distinct(SYMBOL,.keep_all = T) # Very crude, just as an example

res对象是一个标准的DESeq2差异分析的输出结果，包含所有的基因：

head(res)
#           ENSEMBL   SYMBOL   baseMean log2FoldChange      lfcSE       stat       pvalue         padj
# 1 ENSG00000000003   TSPAN6  708.60217    -0.38125389 0.10065443 -3.7877507 1.520173e-04 1.283638e-03
# 2 ENSG00000000419     DPM1  520.29790     0.20681272 0.11221867  1.8429438 6.533721e-02 1.965458e-01
# 3 ENSG00000000457    SCYL3  237.16304     0.03792059 0.14344472  0.2643568 7.915050e-01 9.114580e-01
# 4 ENSG00000000460 C1orf112   57.93263    -0.08816770 0.28714200 -0.3070526 7.588033e-01 8.950344e-01
# 5 ENSG00000000971      CFH 5817.35287     0.42640222 0.08831338  4.8282857 1.377134e-06 1.821495e-05
# 6 ENSG00000001036    FUCA2 1282.10639    -0.24107114 0.08872064 -2.7171933 6.583813e-03 3.277906e-02

1.1 设置cutoff，获取上下调的DEGs，用于ORA富集分析：

# Define an up-regulated gene list
up.genes <- res$SYMBOL[res$log2FoldChange > 2 & res$padj < 0.05]

# Define a down-regulated gene list
down.genes <- res$SYMBOL[res$log2FoldChange < -2 & res$padj < 0.05]

根据这个cutoff进行截断获得数量不等的上调或下调基因：

length(up.genes)
# [1] 128
length(down.genes)
# [1] 107

1.2 如果用户需要就行GSEA富集分析的话，需要对差异基因列表进行排序，进行如下操作即可：

# Obtain an order ranked geneList.
grlist <- res$log2FoldChange
names(grlist) <- res$SYMBOL
grlist <- sort(grlist,decreasing = T)

grlist是一个按照log2FoldChange排序的基因/log2FoldChange字符串：

head(grlist)
# ALOX15B       ZBTB16       GUCY2D       STEAP4 LOC101928858      ANGPTL7 
# 9.505972     7.352628     5.885112     5.207160     5.098107     5.081572

1.3 BioEnricher纳入了11种基因富集数据库，除了常用的GO，KEGG，Reactome以外，还有：

listEnrichMethod()
# "GO", "KEGG", "MKEGG", "WikiPathways", "Reactome", "MsigDB", "DO", "CGN", "DisGeNET", "CellMarker", "CMAP"

Step2. ORA富集分析

2.1 函数`lzq_ORA`

函数lzq_ORA将执行基因集富集分析，用户可以指定enrich.type参数使用其中一种基因富集数据库运行，例如KEGG富集分析：

# Set enrich.type using an enrichment analysis method mentioned above.
kegg <- lzq_ORA(
  genes = up.genes,
  enrich.type = 'KEGG'
)

KEGG富集分析可以使用pathviewR包进行可视化

res2 <- res[res$log2FoldChange > 0 & res$padj < 0.05,c(2,4)]
res2 <- data.frame(row.names = res2$SYMBOL,R=res2$log2FoldChange)

lzq_KEGGview(gene.data = res2,pathway.id = 'hsa04218')

2.2 函数`lzq_ORA.integrated`

用户可以使用lzq_ORA.integrated进行多种数据库的批量分析：

例如对上调的基因进行富集分析：

# Integrative enrichment analysis of the up-regulated gene list
up.enrich <- lzq_ORA.integrated(
  genes = up.genes,
  background.genes = NULL,
  GO.ont = 'ALL',
  perform.WikiPathways = T,
  perform.Reactome = T,
  perform.MsigDB = T,
  MsigDB.category = 'ALL',
  perform.Cancer.Gene.Network = T,
  perform.disease.ontoloty = T,
  perform.DisGeNET = T,
  perform.CellMarker = T,
  perform.CMAP = T,
  min.Geneset.Size = 3
)
### This function will output its calculation process.
# +++ Updating gene symbols...
# Maps last updated on: Thu Oct 24 12:31:05 2019
# +++ Transforming SYMBOL to ENTREZID...
# 'select()' returned 1:1 mapping between keys and columns
# +++ Performing GO-ALL enrichment...
# +++ Symplifying GO results...
# +++ Performing KEGG enrichment...
# +++ Performing Module KEGG enrichment...
# +++ Performing WikiPathways enrichment...
# +++ Performing Reactome pathways enrichment...
# +++ Performing Disease Ontoloty enrichment...
# +++ Performing Cancer Gene Network enrichment...
# +++ Performing DisGeNET enrichment...
# +++ Performing CellMarker enrichment...
# +++ Performing MsigDB-ALL enrichment...         
# +++ Performing CMAP enrichment...
# +++ 1826 significant terms were detected...
# +++ Done!

对下调的基因进行富集分析：

# Integrative enrichment analysis of the down-regulated gene list
down.enrich <- lzq_ORA.integrated(
  genes = down.genes,
  background.genes = NULL,
  GO.ont = 'ALL',
  perform.WikiPathways = T,
  perform.Reactome = T,
  perform.MsigDB = T,
  MsigDB.category = 'ALL',
  perform.Cancer.Gene.Network = T,
  perform.disease.ontoloty = T,
  perform.DisGeNET = T,
  perform.CellMarker = T,
  perform.CMAP = T,
  min.Geneset.Size = 3
)
# +++ Updating gene symbols...
# Maps last updated on: Thu Oct 24 12:31:05 2019
# +++ Transforming SYMBOL to ENTREZID...
# 'select()' returned 1:1 mapping between keys and columns
# +++ Performing GO-ALL enrichment...
# +++ Symplifying GO results...
# +++ Performing KEGG enrichment...
# +++ Performing Module KEGG enrichment...
# +++ Performing WikiPathways enrichment...
# +++ Performing Reactome pathways enrichment...
# +++ Performing Disease Ontoloty enrichment...
# +++ Performing Cancer Gene Network enrichment...
# +++ Performing DisGeNET enrichment...
# +++ Performing CellMarker enrichment...
# +++ Performing MsigDB-ALL enrichment...                       
# +++ Performing CMAP enrichment...
# +++ 1418 significant terms were detected...
# +++ Done!

生成的富集分析结果是一个list列表：

names(up.enrich)
# [1] "GO"                    "simplyGO"              "KEGG"                  "Module.KEGG"           "WikiPathways"         
# [6] "Reactome"              "DiseaseOntology"       "CancerGeneNetwork"     "DisGeNET"              "CellMarker.Enrichment"
# [11] "MsigDB.ALL"            "CMAP"

2.3 可视化

BioEnricher内置了很多高颜值的可视化方案：

barplot

lzq_ORA.barplot1(enrich.obj = up.enrich$simplyGO)

dotplot

lzq_ORA.dotplot1(enrich.obj = up.enrich$simplyGO)

结合上调和下调的通路进行可视化：

lzq_ORA.barplot2(
  enrich.obj1 = up.enrich$simplyGO,
  enrich.obj2 = down.enrich$simplyGO,
  obj.types = c('Up','Down')
)

指定use.Chinese = T支持中文模式：

Note: use.Chinese exists all the plot functions.

lzq_ORA.barplot2(
  enrich.obj1 = up.enrich$simplyGO,
  enrich.obj2 = down.enrich$simplyGO,
  obj.types = c('Up','Down'),
  use.Chinese = T
)

Step3. GSEA富集分析

3.1 函数`lzq_GSEA`

lzq_GSEA函数将执行基因集富集分析，包括GO, KEGG, WikiPathways, Reactome, MsigDB, Disease ontolty, Cancer Gene Network, DisGeNET, CellMarker和CMAP。同样的，指定enrich.type参数可指定目标数据库：

# Set enrich.type using an enrichment analysis method mentioned above.
fit <- lzq_GSEA(grlist, enrich.type = 'KEGG')
# +++ Updating gene symbols...
# Maps last updated on: Thu Oct 24 12:31:05 2019
# +++ Transforming SYMBOL to ENTREZID...
# 'select()' returned 1:many mapping between keys and columns
# +++ Performing KEGG enrichment...
# +++ 5 significant terms were detected...
# +++ Done!

3.2 函数`lzq_GSEA.integrated`

lzq_GSEA.integrated函数可一键式运行所有的基因集数据库：

# Integrative enrichment analysis of the up-regulated gene list
fit2 <- lzq_GSEA.integrated(
  genes = grlist,
  gene.type = 'SYMBOL',
  GO.ont = 'ALL',
  perform.WikiPathways = T,
  perform.Reactome = T,
  perform.MsigDB = T,
  MsigDB.category = 'ALL',
  perform.Cancer.Gene.Network = T,
  perform.disease.ontoloty = T,
  perform.DisGeNET = T,
  perform.CellMarker = T,
  perform.CMAP = T,
  min.Geneset.Size = 3
)
# +++ Updating gene symbols...
# Maps last updated on: Thu Oct 24 12:31:05 2019
# +++ Transforming SYMBOL to ENTREZID...
# 'select()' returned 1:many mapping between keys and columns
# +++ Performing GO-ALL enrichment...
# +++ Symplifying GO results...
# +++ Performing KEGG enrichment...
# +++ Performing Module KEGG enrichment...
# +++ Performing WikiPathways enrichment...
# +++ Performing Reactome pathways enrichment...
# +++ Performing Disease Ontoloty enrichment...
# +++ Performing Cancer Gene Network enrichment...
# no term enriched under specific pvalueCutoff...
# +++ Performing DisGeNET enrichment...
# +++ Performing CellMarker enrichment...
# +++ Performing MsigDB-ALL enrichment...                                               
# +++ Performing CMAP enrichment...
# no term enriched under specific pvalueCutoff...
# +++ 311 significant terms were detected...
# +++ Done!

fit2对象也是一个list结构，每一个元素对应了一个基因集数据库的结果。

3.3 可视化

Visualize analyzing result of GSEA

lzq_gseaplot(
  fit2$simplyGO,
  Pathway.ID = 'GO:0030016',
  rank = F,
  statistic.position = c(0.71,0.85),
  rel.heights = c(1, 0.4)
)

Enrichment barplot for positive or negative GSEA results

lzq_GSEA.barplot1(enrich.obj = fit2$simplyGO,type = 'pos')

Enrichment dotplot for positive or negative GSEA results

lzq_GSEA.dotplot1(enrich.obj = fit2$simplyGO,type = 'pos')

同样的，lzq_GSEA.barplot2支持同时可视化激活和抑制的GSEA通路：

lzq_GSEA.barplot2(enrich.obj = fit2$simplyGO)

支持中文模式：

lzq_GSEA.barplot2(enrich.obj = fit2$simplyGO,use.Chinese = T)

总结

总体来说，BioEnricherR包内置了很多基因集数据库，一键式批量分析非常方便；另外，可视化的颜值也很高，都是发表级的，可见R包作者也是非常细心。更多内容见BioEnricher_0.1.0.pdf官方手册，记得去官方github（https://github.com/Zaoqu-Liu/BioEnricher）给作者点点小星星，支持一波~

另外，关于富集分析，特别是单细胞富集分析，还有一些小细节和小技巧，我有时间整理一下，尽情期待！

- END -

ixxmu / mp_duty