不同数据库的转录因子差异如何 by 生信菜鸟团

学徒作业题目是：转录因子列表哪家全

高通量测序在这几年火速发展，常规的RNA-seq分析是我们先找到合适的相关基因，然后进行下游靶基因的验证。其实，研究调控基因上游的转录因子更能加深后期机制研究的深度。通过转录因子注释和表达量聚类分析，再结合WGCNA分析确定候选转录因子与所关注的性状之间的相关性，建立以转录因子为hub gene的调控网络，这是一个非常系统的机制研究思路。

这里，我们捋了下转录因子和下游靶基因预测的实用网站：

footprintDB（https://footprintdb.eead.csic.es/index.php）存储了transcription factors, DNA motifs, DNA binding sites3种信息。这个数据库能够预测结合特定DNA位点或基序的转录因子，以及可能被特定DNA结合蛋白识别的DNA基序或位点。
Cistrome DB（http://cistrome.org/db/#/）是目前最全面的研究ChIP-seq和DNase-seq的数据库,共收录了30451人和26013小鼠的转录因子、组蛋白修饰和染色质可及性样本。不仅可以查看转录因子调控的基因，详细的数据注释、分析结果和单个数据集的详细信息（数据的QC情况、motif分析结果、潜在的靶基因预测）、同时还可以在基因组浏览器中查看数据的分布及下载分析的结果文件。
TRRUST（https://www.grnpedia.org/trrust/downloadnetwork.php）不仅包含转录因子对应的靶基因，也包含了转录因子间的调控关系。
还有我们常用的JASPAR（http://jaspar.genereg.net/）、AnimalTFDB（http://bioinfo.life.hust.edu.cn）、hTFtarget（http://bioinfo.life.hust.edu.cn/hTFtarget#!/）。
对于非模式动物转录因子和靶基因的预测，我们推荐使用Harmonizome 3.0（maayanlab.cloud/Harmonizome/）。

另外，我们还可以通过KnockTF（https://bio.liclab.net/KnockTFv2/download.php）初步分析目标转录因子敲低后哪些靶基因的表达受到了影响，继续下游的验证实验。

不同数据库中收集的转录因子的信息有所不同，接下来，我们以下列三个数据库：AnimalTFDB 3.0、The Human Transcription Factors 和RcisTarget包自带的motifAnnotations_hgnc_v9数据库为例，为大家展示一下这三个数据集所含转录因子的信息差异：

****读取不同数据库下载得到的TFs列表

#1_来源于AnimalTFDB3，下载链接：http://bioinfo.life.hust.edu.cn/AnimalTFDB/#!/
AnimalTFDB3_TFs=read.csv("C:\Users\Lenovo\Desktop\转录因子Venn图\AnimalTFDB3_Homo_sapiens_TF.csv")
View(AnimalTFDB3_TFs)
#2_来源于CCBR，下载链接：http://humantfs.ccbr.utoronto.ca/.
CCBR_TFs=read.csv("C:\Users\Lenovo\Desktop\转录因子Venn图\CCBR_Homo_sapiens.csv")
View(CCBR_TFs)
#3_提取RcisTarget包里面的motifAnnotations_hgnc数据的TF列表
#首先把前面的提到的多种转录因子基因列表出处摸索一下，然后绘制韦恩图看看
#安装RcisTaiget包
options(repos='http://cran.rstudio.com/')
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RcisTarget")
#To support paralell execution:
BiocManager::install(c("doMC", "doRNG"))
#For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "visNetwork"))
library(RcisTarget)
#加载好RcisTarget包后，可以查看其详细的文档
#Explore tutorials in the web browser:
browseVignettes(package="RcisTarget")
#Commnad line-based:
vignette(package="RcisTarget") # list
vignette("RcisTarget") # open
#Select motif database to use (i.e. organism and distance around TSS)
a=data(list="motifAnnotations_hgnc_v9",package="RcisTarget")
a
dim(motifAnnotations_hgnc_v9)# 这个RcisTarget包内置的motifAnnotations_hgnc是16万行
#可以看到每个转录因子基因有多个motif，但是不到2000个转录因子
motifAnnotations_hgnc_v9[1:4,1:4]
length(unique(motifAnnotations_hgnc_v9$TF)) #获得1839个转录因子
list(motifAnnotations_hgnc_v9$TF)

****Venn图的展示

VENN.LIST=list(
RcisTarget_TF=motifAnnotations_hgnc_v9$TF,
AnimalTFDB3_TFs=AnimalTFDB3_TFs$Symbol,
CCBR_TFs=CCBR_TFs$TF
)
require("VennDiagram")
venn.plot <- venn.diagram(VENN.LIST , NULL,
fill=c("red", "blue",'green'),
alpha=c(0.5,0.5,0.5), cex = 2, cat.fontface=4,
category.names= names(VENN.LIST),
main="venn.diagram")
grid.draw(venn.plot)

library(UpSetR)
p=upset(fromList(VENN.LIST), order.by = "freq")
p
p$New_dat

可以说是，各网站的大部分转录因子信息相同，但还是存在一些差异。在我们筛选转录因子用于研究时，建议整合至少三个网站的信息进行overlapping分析！

转录因子列表哪家全

两个网页工具

关于转录因子列表我在生信菜鸟团公众号看到了有一个介绍：TCGA数据挖掘常见基因集合，首先是Cancer Manag Res. 2020的文章《Prognostic and Predictive Value of a 15 Transcription Factors (TFs) Panel for Hepatocellular Carcinoma》就提到了：

The TF list was downloaded from the Human Transcription Factors website (http://humantfs.ccbr.utoronto.ca/.

然后是2021的文章《A Transcription Factor-Based Risk Model for Predicting the Prognosis of Prostate Cancer and Potential Therapeutic Drugs》提到一个出处：

Atotal of 1665 transcription factors were obtained from the Animal TFDB database。（AnimalTFDB 3.0 ），链接：http://bioinfo.life.hust.edu.cn/AnimalTFDB/#!/

这两个数据库关于转录因子的收录，都是接近于2000个基因。

两个文献

首先是2018的CELL文章：《The Human Transcription Factors》

然后是2020的NBT文章：《A comprehensive library of human transcription factors for cell fate engineering》

刘小乐的Cistrome数据库

详见：http://cistrome.org/db/#/stat

我下载里面的 human_factor_full_QC.txt 文件，然后统计了一下，在人类这个研究领域，有chip-seq数据的转录因子是1359个，略低于上面的两个网页数据库里面的1600~2000的数量。

RcisTarget包里面的motifAnnotations_hgnc数据

代码如下所示：

library(RcisTarget)
# Select motif database to use (i.e. organism and distance around TSS)
data(motifAnnotations_hgnc)
# 这个RcisTarget包内置的motifAnnotations_hgnc是16万行
# 可以看到每个转录因子基因有多个motif，但是不到2000个转录因子
motifAnnotations_hgnc[1:4,1:4]
length(unique(motifAnnotations_hgnc$TF))

可以看到是1839个转录因子：

> dim(motifAnnotations_hgnc)
[1] 163192      7
> motifAnnotations_hgnc[1:4,1:4]
              motif     TF directAnnotation inferred_Orthology
1:   bergman__Abd-B  HOXA9            FALSE               TRUE
2:    bergman__Aef1   ZNF8            FALSE               TRUE
3:     bergman__Cf2 ZNF853            FALSE               TRUE
4: bergman__EcR_usp  NR1H2            FALSE               TRUE
> length(unique(motifAnnotations_hgnc$TF))
[1] 1839

这些转录因子的基因列表很容易导出。

文末友情宣传

强烈建议你推荐给身边的博士后以及年轻生物学PI，多一点数据认知，让他们的科研上一个台阶：

ixxmu / mp_duty

不同数据库的转录因子差异如何 #3985