itmat / rum

RNA-Seq Unified Mapper
http://cbil.upenn.edu/RUM
MIT License
25 stars 4 forks source link

Erroneous RUM_NU chunk #155

Closed safisher closed 11 years ago

safisher commented 11 years ago

Some of my RUM_NU.XX files are not properly formatted. I'm using 2.0.3_04. So far I'm only seeing this with some chunks in one sample.

Here is the error message from the chunk's log file:

Mon Dec 3 13:45:03 2012 31586 FATAL RUM::Death - ERROR: in script rum2sam.pl: the first line of the file '/local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_NU.21' is misformatted, it does not look like a RUM output file. Mon Dec 3 13:45:03 2012 24315 FATAL RUM::Death - Error running "perl /home/fisher/local/src/RUM-Pipeline-v2.0.3_04/bin/../bin/rum2sam.pl --genome-in /lab/repo/resources/rum2/drosophila/drosophila_genome_one-line-seqs.fa --unique-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_Unique.21 --non-unique-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_NU.21 --reads-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/reads.fa.21 --quals-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/quals.fa.21 --sam-out /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM.sam.21.tmp.IFIpabw3 "

The stderr from that command is

ERROR: in script rum2sam.pl: the first line of the file '/local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_NU.21' is misformatted, it does not look like a RUM output file.

The error log file Sample_H0D11/rum.trim/log/rum_errors_21.log may have more details.

Here is the head of the RUM_NU.21 file:

TTGCAAA seq.43478182b chrM 12798-12893 + ATATCACAATTTTTTAAAGATAGAAACCAACCTGGCTTACACCTGTTTGAACTCAGATCATGTAAGAATTTAAAAGTCGAACAGACTTAAAATTTG GGCGCTGTG seq.43478182b chrU 5301326-5301421 + ATATCACAATTTTTTAAAGATAGAAACCAACCTGGCTTACACCTGTTTGAACTCAGATCATGTAAGAATTTAAAAGTCGAACAGACTTAAAATTTG seq.43478195b chrUextra 12749462-12749557 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 10617190-10617285 - GTCAATATTCTAGTATGAGAAATTAACGATTTAAGTCCTTCTTAAATGAGGCCATTTACCCATAGAGGGTGCCAGGCCCGTATAACGTTAATGATT seq.43478195b chrUextra 12262579-12262674 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 11390252-11390347 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 8857124-8857219 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 8717733-8717828 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC

mdelaurentis commented 11 years ago

So it looks like there are stray sequence fragments on their own lines, is that correct? So the first line of the file is "TTGCAAA"? Is the fourth line "GGCGCTGTG" by itself, or did that line get broken somewhere in the process of submitting the issue?

It might help me debug it if you can get me the input files for the sample that is having these errors. If it's very large, even a fragment of the input that contains some of the erroneous reads would be helpful.

If the first record for some of the chunks is corrupted, that seems like it might point to an error with the filesystem or something like that, which we can perhaps handle more gracefully. However, if there are corrupted records scattered throughout the files, that would probably indicate a proper bug.

On Mon, Dec 3, 2012 at 2:07 PM, safisher notifications@github.com wrote:

Some of my RUM_NU.XX files are not properly formatted. I'm using 2.0.3_04. So far I'm only seeing this with some chunks in one sample. Here is the error message from the chunk's log file:

Mon Dec 3 13:45:03 2012 31586 FATAL RUM::Death - ERROR: in script rum2sam.pl: the first line of the file '/local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_NU.21' is misformatted, it does not look like a RUM output file. Mon Dec 3 13:45:03 2012 24315 FATAL RUM::Death - Error running "perl /home/fisher/local/src/RUM-Pipeline-v2.0.3_04/bin/../bin/rum2sam.pl--genome-in /lab/repo/resources/rum2/drosophila/drosophila_genome_one-line-seqs.fa --unique-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_Unique.21 --non-unique-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_NU.21 --reads-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/reads.fa.21 --quals-in /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/quals.fa.21 --sam-out /local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM.sam.21.tmp.IFIpabw3 "

The stderr from that command is

ERROR: in script rum2sam.pl: the first line of the file '/local16/fisher/e.53/Sample_H0D11/rum.trim/chunks/RUM_NU.21' is misformatted, it does not look like a RUM output file.

The error log file Sample_H0D11/rum.trim/log/rum_errors_21.log may have more details. Here is the head of the RUM_NU.21 file:

TTGCAAA seq.43478182b chrM 12798-12893 + ATATCACAATTTTTTAAAGATAGAAACCAACCTGGCTTACACCTGTTTGAACTCAGATCATGTAAGAATTTAAAAGTCGAACAGACTTAAAATTTG GGCGCTGTG seq.43478182b chrU 5301326-5301421 + ATATCACAATTTTTTAAAGATAGAAACCAACCTGGCTTACACCTGTTTGAACTCAGATCATGTAAGAATTTAAAAGTCGAACAGACTTAAAATTTG seq.43478195b chrUextra 12749462-12749557 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 10617190-10617285 - GTCAATATTCTAGTATGAGAAATTAACGATTTAAGTCCTTCTTAAATGAGGCCATTTACCCATAGAGGGTGCCAGGCCCGTATAACGTTAATGATT seq.43478195b chrUextra 12262579-12262674 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 11390252-11390347 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 8857124-8857219 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC seq.43478195b chrUextra 8717733-8717828 + AATCATTAACGTTATACGGGCCTGGCACCCTCTATGGGTAAATGGCCTCATTTAAGAAGGACTTAAATCGTTAATTTCTCATACTAGAATATTGAC

— Reply to this email directly or view it on GitHubhttps://github.com/PGFI/rum/issues/155.

mdelaurentis commented 11 years ago

I believe we were unable to pin down a cause, and we saw this behavior occur randomly in jobs run on the same infrastructure. I think we can assume it was a filesystem error, and I don't think we're going to go to the effort of trying to recover in the face of those kinds of errors.