Closed khughitt closed 10 years ago
Thank you for using EXTREME. I am happy to troubleshoot you through this problem.
It appears your .words file has well over 10,000 words. The greedy algorithm in the run_consensus_clusering_using_wm.pl script scales O(n^3), which makes it very inefficient. If you look in the original Bioinformatics paper you will see that only about 1,000 words are generated. The error you are getting is likely due to memory issues. You may want to try different arguments earlier in the pipeline to reduce the number of words. For example, set minsites to 10 (the default) instead of 5. You may have to play around with the arguments. It will also depend on whether you are looking at ChIP-Seq or DNase-Seq data, and what kind of TF you are looking at if the former. 32,456 500bp is a very big file, do you mind telling me what kind of dataset you are looking at?
I also noticed a lot of AC repeats in your words file. This is characteristic of enhancer elements. While this may be biologically relevant, these repeats cause a lot of problems for finding the binding preference of TFs. Did you remember to use the masked reference genome (http://www.repeatmasker.org/PreMaskedGenomes.html)?
@daquang Thanks for the quick response and suggestions!
As you suggested, it appears that the underlying problem is related to memory limitations. After running the commands in run_consensus_clusering_using_wm.pl
separately, I came across the following error:
java -Xmx2500m -cp EXTREME/src/motif.jar motif.HierarchicalClustering build/input.words.wm.dist 0.3 1 > build/input.words.cluster
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true -Dswing.defaultlaf=com.sun.java.swing.plaf.gtk.GTKLookAndFeel
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.io.BufferedReader.readLine(BufferedReader.java:349)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at motif.GeneUtil.readFileToStringArray(GeneUtil.java:269)
at motif.HierarchicalClustering.readSimMatrix(HierarchicalClustering.java:332)
at motif.HierarchicalClustering.main(HierarchicalClustering.java:33)
I am not too familiar with system processes in Perl, but perhaps you could capture the output / return code for each of the system calls and stop execution of the script and print the error message if something like this occurs? This would make it easier to track down similar issues in the future.
Now that I know this is the issue, I will try experimenting more with the parameters to reduce the size of the problem, or break it up into parts. The minsites
parameters seems like a pretty good place to start in this regard, so I will give that a shot.
The input sequence file is not quite as large as I originally suggested -- I am looking at ~4000 sequences (I accidentally reported the number of lines; 32456), but even that I think I should be able to limit further.
The basic problem I'm working on is actually neither ChIP-Seq or DNase-Seq related, but instead related to the identification of motifs bound by RNA binding proteins. This is perhaps not what EXTREME was designed for, but it seemed like it should also be useful for this type of problem. In my case, I am attempting to look in the UTRs of clusters of co-expressed genes in a group of organisms (Trypanosomatids) which are thought to rely primarily on post-transcriptional regulation of gene expression. Initially, I plan to case as big a net as possible to avoid ruling out anything that may be relevant. Repeat masking is probably a way way to drill down further though once I can rule out any role of the repeat regions in this case. It doesn't look like the organisms I work with are on repeatmasker.org, but I imagine there are a bunch of tools which can help to generate masked versions of a genome.
Thanks again for the feedback and suggestions!
Glad to be of help! That is indeed strange. Whenever I have memory issues with the Perl scripts, I do get the Java memory outputs. I will look into this.
Yes, you are correct, EXTREME has never really been tested before on UTRs to look for RNA binding protein motifs. I am curious as to what the results might look like. Please keep me updated.
Will do! Let me know if there is anything I can do to help as far as testing goes. I would be glad to keep you posted on our progress as well. It turns out there was an issue in my the code I wrote to generate the input sequences, so the problem should be much more reasonable (~50-150 sequences instead of 4000).
For some sets of input sequences, the clustering step (handled by
run_consensus_clusering_using_wm.pl
) is failing without any warning -- only an empty output file is generated.Here is an example input file:
Which was built from a 2M FASTA file containing 32,456 500bp sequences. This was tested using the recommended clustering threshold of 0.3.
For other similar sets of input sequences, I am able to run through the entire pipeline and generate a list of consensus motifs, so the issue has something to do with the set up input sequences. For this particular problem, the empty files results for 4/10 sets of sequences.
I tried rerunning the perl script with similar results, and no warnings/errors generated at any point.
System info:
If you aren't able to reproduce the issue using the above .words file, let me know and I can post a file containing the input sequences.