Closed taltman closed 9 years ago
Could you send me a test input file that generates this error?
Thanks!
On Sun, Apr 26, 2015 at 8:24 PM, Tomer Altman notifications@github.com wrote:
I let MC decide the file type and the FASTQ quality score encoding, so not sure how this happened.
Any help in figuring this out would be great. Thanks!
taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out Traceback (most recent call last): File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py", line 48, in est_ags, args = microbe_census.run_pipeline(args) File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 480, in run_pipeline process_seqfile(args, paths) File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 273, in process_seqfile for rec in parse(open_file(args['seqfile']), args['fastq_format'] if args['file_type'] == 'fastq' else 'fasta'): File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in parse for r in i: File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator for title_line, seq_string, quality_string in FastqGeneralIterator(handle): File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922, in FastqGeneralIterator raise ValueError("Sequence and quality captions differ.") ValueError: Sequence and quality captions differ.
real 0m9.063s user 0m8.930s sys 0m0.169s
— Reply to this email directly or view it on GitHub https://github.com/snayfach/MicrobeCensus/issues/4.
Here is the offending sequence in your dataset:
@SRR172902.422002 NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR172902.422002 ltrim=1
The quality and sequence headers must be the same, otherwise BioPython throws an error.
Well, the same, modulo the first char, right? :-)
Thanks for catching this. I will pass along the error to the SPAdes team, as I used their read corrector for trimming. Though, based on the looks of that read, I'm having my doubts...
The odd thing is that I ran this file through DIAMOND as well, without any complaints. I guess the BioPython parser is strict.
I've modified the code so that this should no longer be an issue. Could you try pulling the latest code?
Not that Wikipedia is authoritative, but:
https://en.wikipedia.org/wiki/FASTQ_format
"Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again."
It indeed looks like the BioPython parser is needlessly strict.
I installed from the tarball rather than cloning. I'll try cloning now.
Now I get this error, with exit status 1, and no output:
Warning: sequence record could not be parsed from input file. Skipping...
Error! No reads remaining after filtering!
What command did you use to run it? Did you use defaults?
On Sun, Apr 26, 2015 at 10:03 PM, Tomer Altman notifications@github.com wrote:
Now I get this error, with exit status 1, and no output:
Warning: sequence record could not be parsed from input file. Skipping... Error! No reads remaining after filtering!
— Reply to this email directly or view it on GitHub https://github.com/snayfach/MicrobeCensus/issues/4#issuecomment-96506384 .
Exact same as in original post. No changes.
Does it run to completion for you?
You've specified 500 bp reads (-l 500), but the input file contains only short reads. If you remove -l 500, and let MicrobeCensus pick the read length to use, it should work.
Also, you specified 40,711 reads, but in general you will need more reads than this to get an accurate estimate of AGS. I'd suggest at least 500,000. But I can understand using fewer reads just for testing.
If you try running the program again using default parameters (at least removing -l 500) it should run to completion.
I specified -l 500, because my reads have already been trimmed, and I'd rather not have MicrobeCensus re-trim my trimmed reads. I read the option documentation as meaning: any reads longer than 500 will be trimmed to 500. Is there a different way to achieve this?
As for the low # of reads, that was a mistake. Sorry to bother you.
I can confirm that the program now works. Excellent!
I did get this line in the terminal, though: Warning: sequence record could not be parsed from input file. Skipping... Not exactly sure what that means. Might be helpful to specify the line number for the offending input, along with the FAS{A|Q} identifier.
MicrobeCensus trims reads to a uniform length, because it uses read-length specific parameters when estimating AGS. The documentation should read: all reads are trimmed to this length, and reads shorter than this length are discarded.
You can use the verbose flag (-v) to get a better sense of what the software is actually doing at each step. It might help things make more sense.
Thanks for the advice! I'll add that.
I've finally fixed this issue in MicrobeCensus (v1.1.0). The program should no longer crash when sequence and quality captions differ.
I let MC decide the file type and the FASTQ quality score encoding, so not sure how this happened.
Any help in figuring this out would be great. Thanks!
taltman1@corn02:/dev/shm/taltman1_tmp/MicrobeCensus$ time run_microbe_census.py -n 40711 -l 500 -t 16 my.fastq test.out Traceback (most recent call last): File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/scripts/run_microbe_census.py", line 48, in
est_ags, args = microbe_census.run_pipeline(args)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 480, in run_pipeline
process_seqfile(args, paths)
File "/afs/ir/users/t/a/taltman1/farmshare/third-party/bin/MicrobeCensus/MicrobeCensus-1.0.3/microbe_census/microbe_census.py", line 273, in process_seqfile
for rec in parse(open_file(args['seqfile']), args['fastq_format'] if args['file_type'] == 'fastq' else 'fasta'):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/init.py", line 582, in parse
for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 922, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.
real 0m9.063s user 0m8.930s sys 0m0.169s