rdpstaff / Framebot

Dynamic programming based frame shift detection and correction tool with nearest neighbor classification.
GNU General Public License v3.0
7 stars 7 forks source link

Failure being caused by certain input sequences. #2

Open passdan opened 10 years ago

passdan commented 10 years ago

I am getting the following error being caused by certain sequences. When I remove the offending sequence the process continues fine, but I can't understand how the error is being caused as the input is exactly the same size and format as the other sequences. No sequences are put into the 'failed' output file.

My input is ~7000 rbcl gene nucleotide sequences being corrected against ~170 ncbi protein references as follows:

java -jar ~/programs/RDPTools/FrameBot.jar framebot  -N -o test bac_protein.fas rbcl_nucl.fas

Sequence input example (failure causing):

>DTM44_641
CGTTTTTTAAATTGTATGGAAGGTATTAACCGTGCTGCAGCTGCAACAGGTGAAGTTAAAGGTTCTTACTTAAACGTTACTGCAGCGACTATGGAAGAAGTACTTAAACGCTGTGAATATGCAAAAGAAGTCGGTTCTATTATTGTTATGATCGATTTAGTTATGGGTTATACAGCAATTCAAAGTGCTGCAATCTGGGCTCGTGACAACGATATGCTTTTACATTTACACCGTGCCGGTAACTCTACTTATGCACGTCAAAAAAGTCATGGTATTAATTTCCGTGTAATCTGTAAATGGATGCGTATGTCTGGTGTTGATCATATTCACGCTGGTACAGTTGTAGGTAAATTAGAAGGTGATCCTTTAATGATTAAAGGTTTCTATGATACTTTACGTTTAACAAAATTTAGAGGTTAATTTACCTTATGGTATTTTTCTTCGAAAGTGACATGGGCAAGTTTACGCCGTTGTATGCCTGTTGCATCTGGTGGTATTCATTGTGGTCAAATGCATCAATTAGTTCACTATTTAGGTGATGATGTAATAT

Error Message:

Exception in thread "main" java.lang.IllegalArgumentException: Cannot score R, ?
        at edu.msu.cme.rdp.alignment.pairwise.ScoringMatrix.score(ScoringMatrix.java:180)
        at edu.msu.cme.rdp.framebot.core.FramebotCore.computeMatrix(FramebotCore.java:81)
        at edu.msu.cme.rdp.framebot.core.FramebotCore.processSequence(FramebotCore.java:67)
        at edu.msu.cme.rdp.framebot.cli.FramebotMain.framebotItUp_prefilter(FramebotMain.java:136)
        at edu.msu.cme.rdp.framebot.cli.FramebotMain.main(FramebotMain.java:381)
        at edu.msu.cme.rdp.framebot.cli.Main.main(Main.java:48)
passdan commented 10 years ago

On further investigation, the failure-causing sequences are those which are the most divergent from the references. it seems that >20 base differences (from ~500) cause the above failure.

Should these sequences not be going into the failure.txt rather than crashing the process though?

rdpstaffmsu commented 9 years ago

Hello! Sorry for the late reply our notifications seem to be acting up, hopefully we can still help. We cannot fully test the issue you were having without the same protein reference file you used, would you be willing to share that with us?

rdpstaffmsu commented 7 years ago

Would you mind sending us (rdpstaff@msu.edu) both the reference and query sets to check? Thank you.

Benli

On Sat, Mar 25, 2017 at 7:30 AM, zoubinok notifications@github.com wrote:

Hello. As I have a series of sequences to process, I try to use local FrameBot. Unfortunately, I cannot use protein as the reference, which can be sued on the web FrameBot. But the index works. So I use the dataset in the RDP pipeline. It has the same problem. So how can I deal with it? Thank you.

Error Message: java -jar /home/server/RDPTools/FrameBot.jar framebot -o amoaaob_test amoA_protref.fasta /home/server/RDPTools/Xander assembler/gene_resource/amoA_AOB/originaldata/nucl.fa Exception in thread "main" java.util.zip.ZipException: Not in GZIP format at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) at java.util.zip.GZIPInputStream.(GZIPInputStream.java:78) at java.util.zip.GZIPInputStream.(GZIPInputStream.java:90) at edu.msu.cme.rdp.framebot.index.FramebotIndex.readExternalIndex(FramebotIndex.java:186) at edu.msu.cme.rdp.framebot.cli.FramebotMain.main(FramebotMain.java:474) at edu.msu.cme.rdp.framebot.cli.Main.main(Main.java:50)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rdpstaff/Framebot/issues/2#issuecomment-289206221, or mute the thread https://github.com/notifications/unsubscribe-auth/AKlEVq5cMHobdXujAv_ummFskOYAOvqGks5rpProgaJpZM4CAm9d .

-- RDP Staff Ribosomal Database Project Center for Microbial Ecology Michigan State University 567 Wilson Rd. Room 2225 A East Lansing, MI 48824 (517) 353-3842

wichne commented 7 years ago

Hi. I ran into a similar problem. One of my sequences would fail with: Exception in thread "main" java.lang.IllegalArgumentException: Cannot score V, O at edu.msu.cme.rdp.alignment.pairwise.ScoringMatrix.score(ScoringMatrix.java:180) at edu.msu.cme.rdp.framebot.core.FramebotCore.computeMatrix(FramebotCore.java:81) at edu.msu.cme.rdp.framebot.core.FramebotCore.processSequence(FramebotCore.java:67) at edu.msu.cme.rdp.framebot.cli.FramebotMain.framebotItUp_prefilter(FramebotMain.java:165) at edu.msu.cme.rdp.framebot.cli.FramebotMain.main(FramebotMain.java:496) at edu.msu.cme.rdp.framebot.cli.Main.main(Main.java:50)

Turns out my framebot.fa seed file had a sequence with an illegal character in it ("O"). The solution was to fix the sequence, although, I suppose removing the sequence or just the illegal character would also work. It is interesting that this only caused a failure with one of many sequences tested.

Cheers, Bill