torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Fatal error: Invalid line 3 in FASTQ file: '+' line must be empty or identical to header #470

Closed kennyyeo13 closed 2 years ago

kennyyeo13 commented 2 years ago

Hi everyone,

I have this error when I run: vsearch -fastq_eestats2 /Volumes/Seagate/16S_analysis/done/PRJNA292800/output/all_samples_concatenated.fastq -output /Volumes/Seagate/16S_analysis/done/PRJNA292800/output/all_samples_eestats2.txt

vsearch v2.19.0_macos_x86_64, 16.0GB RAM, 8 cores https://github.com/torognes/vsearch

Reading FASTQ file 0%

Fatal error: Invalid line 3 in FASTQ file: '+' line must be empty or identical to header

Not really sure how should I fix this

this is my: head -n 20 all_samples_concatenated.fastq

@SRR2163490_001.1 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG +SRR2163490.1 1 length=252 BBBBBFFBFFFFGCFGFEGGGGHGGGGGHHHHGHHHGGGGGHHGGGGEGGGGGGGGGGHFHGFDDGHFHFFDFFHFHHGGGFHHGAGFHHHGHFFHHHGDGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGEEGHHHFHGGGE1HHHHHHHHHHHHHGGB?/E0FHGGGGFHFEGFEHHGGCFFHHGHHHHHHHGGEEEGHHFFGGGFGGGGGGCGGGGFFFFBFD>>AAA @SRR2163490_001.2 TACGTATGTCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGATTGGTCAGTCTGTCTTAAAAGTTCGGGGCTTAACCCCGTGATGGGATGGAAACTGCCAATCTAGAGTATCGGAGAGGAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCAGTGGCGAAGGCGACTTTCTGGACGAAAACTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGG +SRR2163490.10 10 length=252 CCCCCFFFFFCCFGGGEEGGGGHGGFGGHHHHHHHGGGGGHHHGGGGGFEGEEFFGGGHHHFHHHHHHHHHHHGHHHHHHGGGGB@FGHHHHGGGGGHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGFGHHHHHGGGGGHHHHHHHHHHHHHHHGHGGGFFGGGGHHGGGFFGHHHHHHGGGHGHHHHHFEEBEFHHGFGGGGGGGGGGGGGGGBBADBFFAABAA @SRR2163490_001.3 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG +SRR2163490.100 100 length=252 BBBBBFFAFFFFGGGEEFGGGGHGGGGGHHHHHHHHGGGGHHHGGGGGGGGFGGGGG?GHHHHFGHFEHHFHHGHGHHHHEHHGGHHGHHHHHHGGHHG?FIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGEGHHHFHGGGGFHHHHHHHHHGGGHHGGGFFEEHGGGGGHGGFGGHGGGHHFHFFCHHHHHHGEEEEFHHHHGGHFGGFGGGGGGGGFFFD>FFABBBA @SRR2163490_001.4 TACGTATGTCGCAAGCGTTATCCGGAATTATTGGGCATAAAGGGCATCTAGGCGGCCAGATAAGTCTGGGGTGAAAACTTGCGGCTCAACCGCAAGCCTGCCCTGGAAACTATGTGGCTAGAGTGCTGGAGAGGTGGACGGAACTGCACGAGTAGAGGTGAAATTCGTAGATATGTGCAGGAATGCCGATGATGAAGATAGTTCACTGGACGGTAACTGACGCTGAAGTGCGAAAGCTAGGGGAGCAAACAGG +SRR2163490.1000 1000 length=253 CCBCCFFFFFCCGGGFFFGGGGHGGGGGHHHHHHHHHHHGHGHGGGGHHHHHHGGGGGHHHHHHHHHHFGGEFGHHHGHHHHGGGGGHHHHGGGGGHGHGHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGFGFHGFFGHHGCFEHHHHHHHHHHHGHHHHFEE1GGG5HHHHHHHHHHHHHHFBEGEGGHHHHGGCGEAHGHHGEEEFGGGGGGGGGGGFFFFFFFAAAAA @SRR2163490_001.5 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTGATAAGTCTGAAGTTAAAGGCTGTGGCTCAACCCTAGTTCGCTTTGGAAACTGTCAAACTTGAGTGCAGAAAGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG +SRR2163490.10000 10000 length=252 AAAAAFF?13DFEEE?EEEEFGHGGGGGGHHGFGGHGEAEGGFGGGGGGGG/EEEEGGEFFHGFHHHHEFDGG2FFGFEGGGHFFGFFHGH/0FB1FG/?AIIIIIIIIIIIIIIIIIIIIIIIIIIII1IIIIIIIIIIIIIIIIIIIIICEEGHHHHHGGEEFFHFHHGHHHGGHHHHEECFEEFEECGFGFB/CEAEF0A0FECFBG1HGFHFEEEEEHCHHGCHGFGGFEGCGFEGAAA>13FA>A1A

Hope someone can help! thanks

torognes commented 2 years ago

Hi

The FASTQ input file format is invalid. The third line in each entry, which starts with a plus sign (+), should either be empty after the plus sign, or be identical to the first line (except starting with + instead of @).

In very old FASTQ files the third line often contained a copy of the first line, but this convention was quickly dropped and the third line was left blank after the plus sign.

Here the first line of the first entry contains @SRR2163490_001.1 while the third line contains +SRR2163490.1 1 length=252, which is illegal.

The simplest solution is to delete everything after the + sign on the third line of each entry (but you'll then lack the length and other info there).

Please note that sometimes the fourth line in each entry could start with a plus line as well, as this is a valid quality score character.

Please see https://en.wikipedia.org/wiki/FASTQ_format for details.

The error is probably in the software that produced the input file (all_samples_concatenated.fastq).

frederic-mahe commented 2 years ago

The simplest solution is to delete everything after the + sign on the third line of each entry (but you'll then lack the length and other info there).

sed --in-place 's/^+SRR.*/+/' all_samples_concatenated.fastq
torognes commented 2 years ago

The simplest solution is to delete everything after the + sign on the third line of each entry (but you'll then lack the length and other info there).

sed --in-place 's/^+SRR.*/+/' all_samples_concatenated.fastq

Thanks, but this could remove some of the quality lines (fourth line in entries) as well since + is unfortunately a valid quality character.

frederic-mahe commented 2 years ago

yes, you are right. I wrote it so the pattern also requires the presence of SRR, which is less likely in quality strings. The fact that +and @ can occur in quality lines is an annoying design flaw of the fastq format, and there is no easy workaround.

frederic-mahe commented 2 years ago

The way vsearch handles fastq files has already been thoroughly tested, but turning issue tickets into automatic tests is a good practice. So here are three (probably redundant) minimal tests reproducing this particular issue: https://github.com/frederic-mahe/vsearch-tests/commit/0f60a18ce9f2f7e34a35aa0ec61cf09773d62ac5

torognes commented 2 years ago

yes, you are right. I wrote it so the pattern also requires the presence of SRR, which is less likely in quality strings. The fact that +and @ can occur in quality lines is an annoying design flaw of the fastq format, and there is no easy workaround.

Yes you are right! SRR should not appear in ordinary quality strings (the latest character is I or J), so it should work in this case.

frederic-mahe commented 2 years ago

@kennyyeo13 I will close this issue. Please feel free to re-open if you have further comments.

frederic-mahe commented 2 years ago

covered by tests https://github.com/frederic-mahe/vsearch-tests/commit/0f60a18ce9f2f7e34a35aa0ec61cf09773d62ac5