uci-cbcl / EXTREME

An online EM implementation of the MEME model for fast motif discovery in large ChIP-Seq and DNase-Seq Footprinting data
GNU General Public License v2.0
30 stars 4 forks source link

README for EXTREME 2.0

NOTE: As of 2018, this repository is deprecated. Unless you have specific reason to use EXTREME, I suggest to use YAMDA.

EXTREME is an efficient motif discovery algorithm. It applies the online EM algorithm to discover motifs. It uses the same model as MEME, representing sequence binding preference as position frequency matricies (PFMs). EXTREME is written in Python, and incorporates source code from MEME and DREME (by T. Bailey), which are part of the MEME Suite. EXTREME is published. You can download an Advance Access version of the paper here. If you have any more questions you can e-mail me at dxquang@uci.edu.

Citing EXTREME

Quang, D., & Xie, X. (2014). EXTREME: an online EM algorithm for motif discovery. Bioinformatics, btu093.

INSTALL

Required

Optional

Install from source

Download the latest release ([zip] (https://github.com/uci-cbcl/EXTREME/archive/v2.0.0.zip) [tar.gz] (https://github.com/uci-cbcl/EXTREME/archive/v2.0.0.tar.gz)) and decompress.

Optional: If you want to calculate E-values for your motifs, then you need to build Cython bindings to the MEME source files. Keep in mind that Cython and gcc are usually difficult to work with. I have had the best success on a Linux setup. cd into the src folder, and use the following command:

$ python setup.py build_ext --inplace

USAGE

Arguments

The following are arguments for GappedKmerSearch.py, the word searching algorithm for the seeding:

The following are arguments for run_consensus_clusering_using_wm.pl, the hierarchical clustering algorithm for the seeding:

The following are arguments for EXTREME.py, the EXTREME algorithm:

Running EXTREME

An example of running EXTREME using the included ENCODE GM12878 NRSF ChIP-Seq dataset. cd into the ExampleFiles directory. First, we need to generate some seeds:

$ python ../src/fasta-dinucleotide-shuffle.py -f GM12878_NRSF_ChIP.fasta > GM12878_NRSF_ChIP_shuffled.fasta
$ python ../src/GappedKmerSearch.py -l 8 -ming 0 -maxg 10 -minsites 5 GM12878_NRSF_ChIP.fasta GM12878_NRSF_ChIP_shuffled.fasta GM12878_NRSF_ChIP.words
$ perl ../src/run_consensus_clusering_using_wm.pl GM12878_NRSF_ChIP.words 0.3
$ python ../src/Consensus2PWM.py GM12878_NRSF_ChIP.words.cluster.aln GM12878_NRSF_ChIP.wm

The first line generates a dinucleotide shuffled version of the positive sequence set to serve as a negative sequence set. The second line finds gapped words with two half-sites of length 8, between 0 and 10 universal wildcard gap letters, and at least 5 occurrences in the positive sequence set. The third line clusters the words and outputs the results to GM12878_NRSF_ChIP.words.cluster.aln (run_consensus_clusering_using_wm.pl always outputs results to the input filename with ‘cluster.aln’ appended at the end). The last line converts the clusters into PFMs which can be used as seeds for the online EM algorithm. These PFMs are saved in GM12878_NRSF_ChIP.wm. For your own data, you may need to play around with the parameters to get a good set of seeds.

Now let’s run the online EM algorithm.

$ python ../src/EXTREME.py GM12878_NRSF_ChIP.fasta GM12878_NRSF_ChIP_shuffled.fasta GM12878_NRSF_ChIP.wm 1

EXTREME.py uses PFM seeds from GM12878_NRSF_ChIP.wm to initialize the online EM algorithm. The last argument tells EXTREME which of these seeds to use. GM12878_NRSF_ChIP.wm should have 23 PFM seeds, so the last argument can be any value between 1 and 23 in this case.

We have also included an ENCODE K562 DNase-Seq dataset. Try running EXTREME on your own with this dataset. In our publication, we used the parameters l=4, ming=0, maxg=10, minsites=10, zthresh=5 for the word search portion of the seeding. We also used an initial step size of q=0.02. You can imagine the initial step size as a sort of "shaking" parameter. A larger initial step corresponds to a more vigorous shaking, while a smaller value corresponds to a more gentle shaking. You can try experimenting with other sets of parameters too. Please keep me updated on what you find.

Output files

EXTREME.py outputs files to a folder with the same name as seed the online EM algorithm is initialized from. For example, the first seed in our NRSF example has the name “cluster1”, so all files will be output to the “cluster1” folder.

*/Motif_x.png PNG output of the x-th motif. Includes all motifs, not just the most significant ones (that is, the final result after convergence of any seed).

*/Motif_x.eps Same as above, except in EPS format.

*/MEMEoutput.meme Minimal MEME format output of discovered motifs (not all seeds. Only the motifs EXTREME selected at the end of a seed search.)