snayfach / MicrobeCensus

MicrobeCensus estimates the average genome size of microbial communities from metagenomic data
http://genomebiology.com/2015/16/1/51
GNU General Public License v3.0
43 stars 16 forks source link

MicrobeCensus crashes, "ValueError: Sequence and quality captions differ." #4

Closed taltman closed 9 years ago

taltman commented 9 years ago

I let MC decide the file type and the FASTQ quality score encoding, so not sure how this happened.

Any help in figuring this out would be great. Thanks!

taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out Traceback (most recent call last): File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py", line 48, in est_ags, args = microbe_census.run_pipeline(args) File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 480, in run_pipeline process_seqfile(args, paths) File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 273, in process_seqfile for rec in parse(open_file(args['seqfile']), args['fastq_format'] if args['file_type'] == 'fastq' else 'fasta'): File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in parse for r in i: File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator for title_line, seq_string, quality_string in FastqGeneralIterator(handle): File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922, in FastqGeneralIterator raise ValueError("Sequence and quality captions differ.") ValueError: Sequence and quality captions differ.

real 0m9.063s user 0m8.930s sys 0m0.169s

snayfach commented 9 years ago

Could you send me a test input file that generates this error?

Thanks!

On Sun, Apr 26, 2015 at 8:24 PM, Tomer Altman notifications@github.com wrote:

I let MC decide the file type and the FASTQ quality score encoding, so not sure how this happened.

Any help in figuring this out would be great. Thanks!

taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out Traceback (most recent call last): File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py", line 48, in est_ags, args = microbe_census.run_pipeline(args) File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 480, in run_pipeline process_seqfile(args, paths) File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 273, in process_seqfile for rec in parse(open_file(args['seqfile']), args['fastq_format'] if args['file_type'] == 'fastq' else 'fasta'): File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in parse for r in i: File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator for title_line, seq_string, quality_string in FastqGeneralIterator(handle): File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922, in FastqGeneralIterator raise ValueError("Sequence and quality captions differ.") ValueError: Sequence and quality captions differ.

real 0m9.063s user 0m8.930s sys 0m0.169s

— Reply to this email directly or view it on GitHub https://github.com/snayfach/MicrobeCensus/issues/4.

snayfach commented 9 years ago

Here is the offending sequence in your dataset:

@SRR172902.422002 NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR172902.422002 ltrim=1

.4455543555555555555654554455545554445555334346344554445555555556555555555

The quality and sequence headers must be the same, otherwise BioPython throws an error.

taltman commented 9 years ago

Well, the same, modulo the first char, right? :-)

Thanks for catching this. I will pass along the error to the SPAdes team, as I used their read corrector for trimming. Though, based on the looks of that read, I'm having my doubts...

taltman commented 9 years ago

The odd thing is that I ran this file through DIAMOND as well, without any complaints. I guess the BioPython parser is strict.

snayfach commented 9 years ago

I've modified the code so that this should no longer be an issue. Could you try pulling the latest code?

taltman commented 9 years ago

Not that Wikipedia is authoritative, but:

https://en.wikipedia.org/wiki/FASTQ_format

"Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again."

It indeed looks like the BioPython parser is needlessly strict.

taltman commented 9 years ago

I installed from the tarball rather than cloning. I'll try cloning now.

taltman commented 9 years ago

Now I get this error, with exit status 1, and no output:

Warning: sequence record could not be parsed from input file. Skipping...
Error! No reads remaining after filtering!
snayfach commented 9 years ago

What command did you use to run it? Did you use defaults?

On Sun, Apr 26, 2015 at 10:03 PM, Tomer Altman notifications@github.com wrote:

Now I get this error, with exit status 1, and no output:

Warning: sequence record could not be parsed from input file. Skipping... Error! No reads remaining after filtering!

— Reply to this email directly or view it on GitHub https://github.com/snayfach/MicrobeCensus/issues/4#issuecomment-96506384 .

taltman commented 9 years ago

Exact same as in original post. No changes.

taltman commented 9 years ago

Does it run to completion for you?

snayfach commented 9 years ago

You've specified 500 bp reads (-l 500), but the input file contains only short reads. If you remove -l 500, and let MicrobeCensus pick the read length to use, it should work.

Also, you specified 40,711 reads, but in general you will need more reads than this to get an accurate estimate of AGS. I'd suggest at least 500,000. But I can understand using fewer reads just for testing.

If you try running the program again using default parameters (at least removing -l 500) it should run to completion.

taltman commented 9 years ago

I specified -l 500, because my reads have already been trimmed, and I'd rather not have MicrobeCensus re-trim my trimmed reads. I read the option documentation as meaning: any reads longer than 500 will be trimmed to 500. Is there a different way to achieve this?

As for the low # of reads, that was a mistake. Sorry to bother you.

taltman commented 9 years ago

I can confirm that the program now works. Excellent!

I did get this line in the terminal, though: Warning: sequence record could not be parsed from input file. Skipping... Not exactly sure what that means. Might be helpful to specify the line number for the offending input, along with the FAS{A|Q} identifier.

snayfach commented 9 years ago

MicrobeCensus trims reads to a uniform length, because it uses read-length specific parameters when estimating AGS. The documentation should read: all reads are trimmed to this length, and reads shorter than this length are discarded.

You can use the verbose flag (-v) to get a better sense of what the software is actually doing at each step. It might help things make more sense.

snayfach commented 9 years ago

Thanks for the advice! I'll add that.

snayfach commented 7 years ago

I've finally fixed this issue in MicrobeCensus (v1.1.0). The program should no longer crash when sequence and quality captions differ.