TFforge (Transcription factor forward genomics) is a method to assiciate transciption factor binding site divergence in putative regulatory elements with pehnotypic changes between species [1].
git clone https://github.com/hillerlab/TFforge
Download Stubb (https://github.com/UIUCSinhaLab/Stubb/) and newmat11 (http://www.robertnz.net/download.html)
cp /path/to/stubb_2.1.tar.gz /path/to/newmat11.tar.gz .
tar -xvf stubb_2.1.tar.gz
tar -xvf newmat11.tar.gz -C stubb_2.1/lib/newmat/
# TFforge uses a slightly modified version of Stubb
cd stubb_2.1/
patch -p1 < ../TFforge/stubb.patch
cd lib/newmat/
gmake -f nm_gnu.mak
cd ../../
make
export PATH=$PATH:`pwd`/bin
cd ../TFforge/
export PATH=$PATH:`pwd`
# this directory contains a minimal example of simulated CREs
cd example
# Create a joblist file 'alljobs_simulation' containing the branch_scoring jobs
TFforge.py data/tree_simulation.nwk motifs.ls data/species_lost_simulation.txt elements_simul.ls \
--windowsize 200 --scorefile scores_simulation -bg=background/
# Run job list as batch or in parallel
bash alljobs_scores_simulation > scores_simulation
# Run the association test
TFforge_statistics.py data/tree_simulation.nwk motifs.ls data/species_lost_simulation.txt \
elements_simul.ls --scorefile scores_simulation
# This generates a file 'significant_elements_simulation' that alphabetically lists the elements, their P-value and the number of branches
>motif_name motif_length [optional fields]
pos1_freqA pos1_freqC pos1_freqG pos1_freqT
pos2_freqA pos2_freqC pos2_freqG pos2_freqT
..
posN_freqA posN_freqC posN_freqG posN_freqT
\<
Generate the TFforge branch_scoring commands for all CREs and TFs.
TFforge.py <tree> <motif_list> <lost_species_list> <element_list>
This creates for every CRE and every TF a TFforge_branchscoring.py job. Each line in alljobs\<scorefile> consists a single job. Each job is completely independent of any other job, thus each job can be run in parallel to others.
Execute that alljobs file. Either sequentially via
bash alljobs_simulation > scores_simulation
or run it in parallel by using a compute cluster. Every job returns a line in the following format:
motif_file CRE_file (branch_start>branch_end:branch_score )* <
which should be concatenated into a file called "\<scorefile>".
TFforge_statistics.py <tree> <motif_list> <lost_species_list> <element_list>
TFforge_statistics.py classifies branches into trait-loss and trait-preserving and assesses for every TF the significance the association of this classification with the branch scores of all CREs.
The output (significantfactors
element Pearson NoPos NoNeg
These list for every transcription factor (element) the number of trait-loss branches (NoPos) and trait-preserving branches (NoNeg) and the significance of a positive Pearson correlation (Are trait-loss branches correlating with lower scores?), given by the p-value in the Pearson column.
--scorefile <name>
Name file <name> instead of "scores"
--windowsize/-w <n>
Scoring window used in sequence scoring
--background/-bg <folder>
Background used for sequence scoring. Either a file or a folder structure with backgrounds for different GC contents
--scrCrrIter <n>
Stubb score is corrected with the average score of <number> of shuffled sequences. Default is 10. 0 turns the score correction off
--scorefile <name>
Use <name> as scorefile instead of "scores"
--filterspecies <comma separated list>
Exclude species from analyses
--elements <file>
Analyse only the elements specified <file>
--verbose/-v
--debug/-d
--no_ancestral_filter
Turn of ancestral score filtering
--no_branch_filter
Turn of branch score filtering
--no_fixed_TP
Do not fix transition probabilites while computing branch scores
--filter_branch_threshold <x>
Threshold for branch filter; By default branches are filtered if start and end node score are below 0
--filter_GC_change <x>
Filter branches with a GC content change above <x>
--filter_length_change <x>
Filter branches with a relative length change above <x>
[1] Langer BE, Hiller M. TFforge utilizes large-scale binding site divergence to identify transcriptional regulators involved in phenotypic differences. Nucleic Acids Research, 47(4), 2019
[2] Sinha S, van Nimwegen E, Siggia ED. A Probabilistic Method to Detect Regulatory Modules. Bioinformatics, 19(S1), 2003
[3] Hubisz MJ, Pollard KS, Siepel A. PHAST and RPHAST: Phylogenetic Analysis with Space/Time Models. Briefings in Bioinfomatics 12(1):41-51, 2011.