Closed kennyyeo13 closed 2 years ago
Hi
The FASTQ input file format is invalid. The third line in each entry, which starts with a plus sign (+
), should either be empty after the plus sign, or be identical to the first line (except starting with +
instead of @
).
In very old FASTQ files the third line often contained a copy of the first line, but this convention was quickly dropped and the third line was left blank after the plus sign.
Here the first line of the first entry contains @SRR2163490_001.1
while the third line contains +SRR2163490.1 1 length=252
, which is illegal.
The simplest solution is to delete everything after the + sign on the third line of each entry (but you'll then lack the length and other info there).
Please note that sometimes the fourth line in each entry could start with a plus line as well, as this is a valid quality score character.
Please see https://en.wikipedia.org/wiki/FASTQ_format for details.
The error is probably in the software that produced the input file (all_samples_concatenated.fastq
).
The simplest solution is to delete everything after the + sign on the third line of each entry (but you'll then lack the length and other info there).
sed --in-place 's/^+SRR.*/+/' all_samples_concatenated.fastq
The simplest solution is to delete everything after the + sign on the third line of each entry (but you'll then lack the length and other info there).
sed --in-place 's/^+SRR.*/+/' all_samples_concatenated.fastq
Thanks, but this could remove some of the quality lines (fourth line in entries) as well since +
is unfortunately a valid quality character.
yes, you are right. I wrote it so the pattern also requires the presence of SRR
, which is less likely in quality strings. The fact that +
and @
can occur in quality lines is an annoying design flaw of the fastq format, and there is no easy workaround.
The way vsearch handles fastq files has already been thoroughly tested, but turning issue tickets into automatic tests is a good practice. So here are three (probably redundant) minimal tests reproducing this particular issue: https://github.com/frederic-mahe/vsearch-tests/commit/0f60a18ce9f2f7e34a35aa0ec61cf09773d62ac5
yes, you are right. I wrote it so the pattern also requires the presence of
SRR
, which is less likely in quality strings. The fact that+
and@
can occur in quality lines is an annoying design flaw of the fastq format, and there is no easy workaround.
Yes you are right! SRR
should not appear in ordinary quality strings (the latest character is I
or J
), so it should work in this case.
@kennyyeo13 I will close this issue. Please feel free to re-open if you have further comments.
Hi everyone,
I have this error when I run: vsearch -fastq_eestats2 /Volumes/Seagate/16S_analysis/done/PRJNA292800/output/all_samples_concatenated.fastq -output /Volumes/Seagate/16S_analysis/done/PRJNA292800/output/all_samples_eestats2.txt
vsearch v2.19.0_macos_x86_64, 16.0GB RAM, 8 cores https://github.com/torognes/vsearch
Reading FASTQ file 0%
Fatal error: Invalid line 3 in FASTQ file: '+' line must be empty or identical to header
Not really sure how should I fix this
this is my: head -n 20 all_samples_concatenated.fastq
@SRR2163490_001.1 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG +SRR2163490.1 1 length=252 BBBBBFFBFFFFGCFGFEGGGGHGGGGGHHHHGHHHGGGGGHHGGGGEGGGGGGGGGGHFHGFDDGHFHFFDFFHFHHGGGFHHGAGFHHHGHFFHHHGDGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGEEGHHHFHGGGE1HHHHHHHHHHHHHGGB?/E0FHGGGGFHFEGFEHHGGCFFHHGHHHHHHHGGEEEGHHFFGGGFGGGGGGCGGGGFFFFBFD>>AAA @SRR2163490_001.2 TACGTATGTCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGATTGGTCAGTCTGTCTTAAAAGTTCGGGGCTTAACCCCGTGATGGGATGGAAACTGCCAATCTAGAGTATCGGAGAGGAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCAGTGGCGAAGGCGACTTTCTGGACGAAAACTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGG +SRR2163490.10 10 length=252 CCCCCFFFFFCCFGGGEEGGGGHGGFGGHHHHHHHGGGGGHHHGGGGGFEGEEFFGGGHHHFHHHHHHHHHHHGHHHHHHGGGGB@FGHHHHGGGGGHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGFGHHHHHGGGGGHHHHHHHHHHHHHHHGHGGGFFGGGGHHGGGFFGHHHHHHGGGHGHHHHHFEEBEFHHGFGGGGGGGGGGGGGGGBBADBFFAABAA @SRR2163490_001.3 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG +SRR2163490.100 100 length=252 BBBBBFFAFFFFGGGEEFGGGGHGGGGGHHHHHHHHGGGGHHHGGGGGGGGFGGGGG?GHHHHFGHFEHHFHHGHGHHHHEHHGGHHGHHHHHHGGHHG?FIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGEGHHHFHGGGGFHHHHHHHHHGGGHHGGGFFEEHGGGGGHGGFGGHGGGHHFHFFCHHHHHHGEEEEFHHHHGGHFGGFGGGGGGGGFFFD>FFABBBA @SRR2163490_001.4 TACGTATGTCGCAAGCGTTATCCGGAATTATTGGGCATAAAGGGCATCTAGGCGGCCAGATAAGTCTGGGGTGAAAACTTGCGGCTCAACCGCAAGCCTGCCCTGGAAACTATGTGGCTAGAGTGCTGGAGAGGTGGACGGAACTGCACGAGTAGAGGTGAAATTCGTAGATATGTGCAGGAATGCCGATGATGAAGATAGTTCACTGGACGGTAACTGACGCTGAAGTGCGAAAGCTAGGGGAGCAAACAGG +SRR2163490.1000 1000 length=253 CCBCCFFFFFCCGGGFFFGGGGHGGGGGHHHHHHHHHHHGHGHGGGGHHHHHHGGGGGHHHHHHHHHHFGGEFGHHHGHHHHGGGGGHHHHGGGGGHGHGHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGFGFHGFFGHHGCFEHHHHHHHHHHHGHHHHFEE1GGG5HHHHHHHHHHHHHHFBEGEGGHHHHGGCGEAHGHHGEEEFGGGGGGGGGGGFFFFFFFAAAAA @SRR2163490_001.5 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTGATAAGTCTGAAGTTAAAGGCTGTGGCTCAACCCTAGTTCGCTTTGGAAACTGTCAAACTTGAGTGCAGAAAGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG +SRR2163490.10000 10000 length=252 AAAAAFF?13DFEEE?EEEEFGHGGGGGGHHGFGGHGEAEGGFGGGGGGGG/EEEEGGEFFHGFHHHHEFDGG2FFGFEGGGHFFGFFHGH/0FB1FG/?AIIIIIIIIIIIIIIIIIIIIIIIIIIII1IIIIIIIIIIIIIIIIIIIIICEEGHHHHHGGEEFFHFHHGHHHGGHHHHEECFEEFEECGFGFB/CEAEF0A0FECFBG1HGFHFEEEEEHCHHGCHGFGGFEGCGFEGAAA>13FA>A1A
Hope someone can help! thanks