uci-cbcl / EXTREME

An online EM implementation of the MEME model for fast motif discovery in large ChIP-Seq and DNase-Seq Footprinting data
GNU General Public License v2.0
30 stars 4 forks source link

run_consensus_clusering_using_wm.pl fails to generate clusters in some cases #2

Closed khughitt closed 10 years ago

khughitt commented 10 years ago

For some sets of input sequences, the clustering step (handled by run_consensus_clusering_using_wm.pl) is failing without any warning -- only an empty output file is generated.

Here is an example input file:

https://gist.github.com/khughitt/393ad7824d407f849162

Which was built from a 2M FASTA file containing 32,456 500bp sequences. This was tested using the recommended clustering threshold of 0.3.

For other similar sets of input sequences, I am able to run through the entire pipeline and generate a list of consensus motifs, so the issue has something to do with the set up input sequences. For this particular problem, the empty files results for 4/10 sets of sequences.

I tried rerunning the perl script with similar results, and no warnings/errors generated at any point.

System info:

EXTREME revision ee505249e630508c99dd87b3985d834bd0b4b6cc

java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (Arch Linux build 7.u65_2.5.1-8-x86_64)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

perl 5, version 20, subversion 0 (v5.20.0) built for x86_64-linux-thread-multi

Python 2.7.8

If you aren't able to reproduce the issue using the above .words file, let me know and I can post a file containing the input sequences.

daquang commented 10 years ago

Thank you for using EXTREME. I am happy to troubleshoot you through this problem.

It appears your .words file has well over 10,000 words. The greedy algorithm in the run_consensus_clusering_using_wm.pl script scales O(n^3), which makes it very inefficient. If you look in the original Bioinformatics paper you will see that only about 1,000 words are generated. The error you are getting is likely due to memory issues. You may want to try different arguments earlier in the pipeline to reduce the number of words. For example, set minsites to 10 (the default) instead of 5. You may have to play around with the arguments. It will also depend on whether you are looking at ChIP-Seq or DNase-Seq data, and what kind of TF you are looking at if the former. 32,456 500bp is a very big file, do you mind telling me what kind of dataset you are looking at?

I also noticed a lot of AC repeats in your words file. This is characteristic of enhancer elements. While this may be biologically relevant, these repeats cause a lot of problems for finding the binding preference of TFs. Did you remember to use the masked reference genome (http://www.repeatmasker.org/PreMaskedGenomes.html)?

khughitt commented 10 years ago

@daquang Thanks for the quick response and suggestions!

As you suggested, it appears that the underlying problem is related to memory limitations. After running the commands in run_consensus_clusering_using_wm.pl separately, I came across the following error:

java -Xmx2500m -cp EXTREME/src/motif.jar motif.HierarchicalClustering build/input.words.wm.dist 0.3 1 > build/input.words.cluster

Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true -Dswing.defaultlaf=com.sun.java.swing.plaf.gtk.GTKLookAndFeel
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:2694)
        at java.lang.String.<init>(String.java:203)
        at java.io.BufferedReader.readLine(BufferedReader.java:349)
        at java.io.BufferedReader.readLine(BufferedReader.java:382)
        at motif.GeneUtil.readFileToStringArray(GeneUtil.java:269)
        at motif.HierarchicalClustering.readSimMatrix(HierarchicalClustering.java:332)
        at motif.HierarchicalClustering.main(HierarchicalClustering.java:33)

I am not too familiar with system processes in Perl, but perhaps you could capture the output / return code for each of the system calls and stop execution of the script and print the error message if something like this occurs? This would make it easier to track down similar issues in the future.

Now that I know this is the issue, I will try experimenting more with the parameters to reduce the size of the problem, or break it up into parts. The minsites parameters seems like a pretty good place to start in this regard, so I will give that a shot.

The input sequence file is not quite as large as I originally suggested -- I am looking at ~4000 sequences (I accidentally reported the number of lines; 32456), but even that I think I should be able to limit further.

The basic problem I'm working on is actually neither ChIP-Seq or DNase-Seq related, but instead related to the identification of motifs bound by RNA binding proteins. This is perhaps not what EXTREME was designed for, but it seemed like it should also be useful for this type of problem. In my case, I am attempting to look in the UTRs of clusters of co-expressed genes in a group of organisms (Trypanosomatids) which are thought to rely primarily on post-transcriptional regulation of gene expression. Initially, I plan to case as big a net as possible to avoid ruling out anything that may be relevant. Repeat masking is probably a way way to drill down further though once I can rule out any role of the repeat regions in this case. It doesn't look like the organisms I work with are on repeatmasker.org, but I imagine there are a bunch of tools which can help to generate masked versions of a genome.

Thanks again for the feedback and suggestions!

daquang commented 10 years ago

Glad to be of help! That is indeed strange. Whenever I have memory issues with the Perl scripts, I do get the Java memory outputs. I will look into this.

Yes, you are correct, EXTREME has never really been tested before on UTRs to look for RNA binding protein motifs. I am curious as to what the results might look like. Please keep me updated.

khughitt commented 10 years ago

Will do! Let me know if there is anything I can do to help as far as testing goes. I would be glad to keep you posted on our progress as well. It turns out there was an issue in my the code I wrote to generate the input sequences, so the problem should be much more reasonable (~50-150 sequences instead of 4000).