Transcription Factors (TFs) are proteins controlling the rate of genetic information, regulating cellular gene expression. A better understanding of TFs in a bacterial community open revenue for exploring gene regulation in ecosystems where bacteria play a key role. Here we describe PredicTF, the first platform supporting the prediction and classification of putative bacterial TF not only in single species but also in complex microbial communities. In summary, we collected publicly available data on TFs. Initially, we chose to collect data from CollecTF; a bacterial TF database containing experimentally validated TFs. This database was merged to TF sequences from UNIPROT. This merged and hand curated TF database (BacTFDB) was used to train a deep learning model (PredicTF) to predict TFs and their families in genomes and metagenomes. Here, we describe the use of PredicTF to predict TFs for single bacterial species (genomes and metatranscriptomes) and complex communities (metagenomes and metatranscriptomes) (Figure 1). Using PredicTF, the user can determine TFs distribution in complex communities, opening the potential to evaluate regulatory networks in different ecosystems. Prediction of Transcription Factors with PredictTF is user-friendly as it only requires users to run a single command.
Rational of the pipeline. The pipeline uses The Bacterial Transcription Factor Database (BacTFDB) and DeepARG approach [1] to train (1. Training) a Deep Learning model named PredicTF. PredicTF can use Genomes (a.1) or Metagenomes (b.1) as input, providing predictions of transcription factors and respective families in a text file (2. Prediction & Annotation). Finally, TFs listed and annotated can be mapped in the Transcriptomes (a.2) or Metatranscriptomes (b.2) providing a list of active TFs in specific conditions (3. Mapping transcripts TFs).
The computational resources vary greatly based on the amount of data in your database. The training step requires intensive computational resources, because of the deep learning, so it is recommended to do the training using the GPU routines from Theano - a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently (http://deeplearning.net/software/theano/) [5]. However, heavy computation is required only once to obtain the deep learning model (PredicTF) and the prediction routines do not require such computational resources. PredicTF is an open source tool updated twice a year and it can be downloaded from the GitHub page.
To create the Bacterial Transcription Factor Data Base (BacTFDB), we collected data from two publicly available databases. Initially, we chose to collect data from CollecTF [11], a well described and characterized database. Since CollecTF does not provide an API for bulk download, we develop Python code (version 2.7) using the Beautiful Soup 4.4.0 library to recover the data from CollecTF. With this strategy, we downloaded sequences from 390 experimentally validated TFs distributed over 44 TFs families. Additionally, we retrieved TF sequences from UNIPROT using UNIPROT’s API and the filters “Reviewed (Swiss-Prot) - Manually annotated”, bacteria taxonomy and a set of specific keywords (Transcription factor, transcriptional factor, regulator, transcriptional repressor, transcriptional activator, transcriptional regulator). The UNIPROT API was accessed on 8-Sep-2019. Next, we merged the data collected from CollecTF and UNIPROT resulting in a total of 21.971 TFs. Next, we removed redundant TF entries and TF sequences lacking a TF family since PredicTF was designed to also assign TF family. Finally, a manual inspection was performed to remove case sensitive and presence of characters associated to the database header. The final database (BacTFDB) contains a total of 11.691 TF unique sequences (Figure 2).
Scheme used for the construction of BacTFDB. Bacterial Transcription Factor Data Base (bacTFDB) were created from from two publicly available databases. We collect 390 TFs from CollecTF and 21.581 from UniProt (accessed 8-Sep-2019) accumulating 21.581 TF amino acid sequences. We merged the data from CollecTF and UniProt databases resulting in a total of 21.971 TFs amino acid. We removed redundant TF entries and since PredicTF was designed to also assign TF family, TF sequences lacking a TF family were removed. Finally, a manual inspection was performed to remove misleading of spelling, case sensitive and presence of characters associate to the database header. The final database (bacTFDB) contains a total of 11.691 TF unique sequences.
To use PredicTF the following is required:
Operating system: Linux64
Programming languages: Python 2.7
Module: Anaconda2/5.3.0
4.1) DEPENDENCIES
PredicTF requires the installation:
DeepARG repository (https://bitbucket.org/gusphdproj/deeparg-largerepo/src/master/) [1].
DIAMOND (https://github.com/python-diamond/Diamond) [2].
Nolearn Lasagne deep learning library (https://lasagne.readthedocs.io/en/latest/) [3].
Sklearn machine learning routines (https://scikit-learn.org/stable/) [4].
Theano (http://deeplearning.net/software/theano/) [5].
the following are only required when integrating (meta)transcriptomic data
Trim Galore - v0.0.4 dev (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) [6].
MetaSPADES - v3.12.0 (https://github.com/ablab/spades#meta) [7].
Emboss transeq (http://www.bioinformatics.nl/cgi-bin/emboss/transeq) [8].
Bowtie2 - v2.3.0 (https://sourceforge.net/projects/bowtie-bio/) [9].
SAMTools - v1.9 (http://github.com/samtools/) [10].
4.2) INSTALLATION
1) Open a terminal and clone the source code:
git clone https://github.com/mdsufz/PredicTF.git
2) If you don't have conda installed use the following:
wget https://repo.anaconda.com/archive/Anaconda2-2019.10-Linux-x86_64.sh
sh Anaconda2-2019.10-Linux-x86_64.sh
Create a conda environment in a location of choice
conda create -n deeparg_env python=2.7.18
3) Activate the conda environment
conda activate deeparg_env
4) Install bioconda and diamond
conda install -c bioconda diamond==0.9.24
5) Install additional dependencies
pip install -r https://raw.githubusercontent.com/dnouri/nolearn/0.6.0/requirements.txt
pip install nolearn
pip install tqdm
6) Clone DeepArg repository inside your PredicTF folder
git clone https://bitbucket.org/gusphdproj/deeparg-largerepo.git
DeepARG original code does not allow for multiple instances of model training and testing. We have modified their source code to allow for this. Copy the following files contained in the folder install_files to their respectives directories inside the deeparg-largerepo folder.
cp /path/to/PredicTF/install_files/main/deepARG.py /path/to/PredicTF/deeparg-largerepo/
cp /path/to/PredicTF/install_files/predict/deepARG.py /path/to/PredicTF/deeparg-largerepo/predict/bin/
BacTFDB model files Due to the size of the model files you will need to use the following link get the necessary files and store them in /PredicTF/BacTFDB/model/v2/:
Files - https://nc.ufz.de/s/e9geJ4FKJk8cWLs Password - 6oHaiWQQY9
Go to the directory where the program was saved and open the file options.py
Replace path = '/deeparg-ss/'; with the current directory (deepARG path).
For instance, deepARG was cloned at /home/user/deeparg-largerepo/ The options.py file should looks like
path = '/home/user/deeparg-largerepo/';
Finally allow diamond to be executed:
Go to the path you modified in the options.py and run:
chmod +x diamond (only for LINUX)
For example:
cd /home/user/deeparg-largerepo
chmod +x diamond
Note: All FASTA files (.fa) must be composed of aminoacid sequences!"
Predict Transcription Factors
If the user only wants to predict Transcription Factors in their target genome(s) using the trained model run the following command:
sh predictf_in_genome.sh /path/to/PredicTF/folder /path/to/target/genome.fa /path/to/output/folder
As an example:
sh predictf_in_genome.sh /home/user/PredicTF /home/user/fasta_files/target_genome.fa /home/project/results
This script takes as input the following:
Training a new model
The user also has the opportunity to generate new models for other genes of interest.
The database(FASTA file) to be used for training requires that the header matches the following structure:
>uniq_id|FEATURES|source|class|name
Example: >A0A024HKB0|FEATURES|CollecTF|LysR|ClcR
In this example A0A024HKB0 is the unique number that identify a specific TF, FEATURES is mandatory, CollecTF is the database where the sequence came from, LysR is the family (class) of transcription factor that ClcR (name) belongs to.
**Note: an example file can be downloaded from here.
Next, the user only needs to run the following command:
sh predictf_train_genes.sh /path/to/project/folder /path/to/fasta_file/for/training.fa /path/to/folder/where/predictf_env/was/created /path/to/predicTF/installation/folder
(Note: this step requires a large amount of computational resources and may need to be performed in a cluster.)
This script takes as input the following:
All resulting files and folders generated during this process will be stored in the user-defined project folder.
To perform predictions in your intended genomes using your own models please run the following command:
sh predictf_in_genome_user.sh /path/to/PredicTF/folder /path/to/target/genome.fa /path/to/output/folder
This tasks takes as inputs the following:
The user also has the possibility to integrate transcriptomic data with genomic data.To do so, run the following command:
sh /path/to/PredicTF/scripts/transcript2genome.sh path/to/PredicTF path/to/folder/with/file.out.mapping.ARG path/to/target/genome.fa name/of/folder/to/store/results
Note.1: name/of/folder/to/store/results will be generated by the script - only provide the name and location (e.g. /home/user/TF2GEN)
Note.2: All modules and packages should be loaded prior to running this command (e.g. Bowtie", SAMtools)
PredicTF is available here.