torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

Invalid line 28205492 in FASTQ file: Sequence and quality lines must be equally long #492

Closed elenu closed 2 years ago

elenu commented 2 years ago

Hello everybody,

Hope you are doing well. I was wondering if you could help me with a "fatal error" message that I get after running the vsearch command on amplicon data: vsearch --fastq_chars merged.fq. The message that it returns is:

vsearch v2.14.1_linux_x86_64, 31.2GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading FASTQ file 100% 

Fatal error: Invalid line 28205492 in FASTQ file: Sequence and quality lines must be equally long

The merged.file has been obtained from running the usearch code: usearch -fastq_mergepairs *R1_001.fastq -reverse *R2_001.fastq -fastq_eeout -fastq_maxdiffs 10 -fastq_maxmergelen 300 -fastqout merged.fq -relabel @ -report merged.txt

I guess the issue might be with the vsearch step because I checked the lines, and it was empty from the 10290675 line. When I saved another file with the content, except the continuation from that line+1, I got another error message mentioning unexpected end of file.

We had a bioinformatician that used this exact code a while ago, and it worked for him. I tried the option to ignore the message, but at the end of the process, I only obtain data corresponding to 15 samples. Thus, I decided to check line by line of the code, instead of running the sh file, and found out the error messages.

I had also oserved that the messages the bioinformatician shared, there's a txt file of the messages from the terminal, where it says "Lengths min 43, lo_quartile 251, median 251, hi_quartile 251, max 251", and so did I. I would understand he would have problems with the mininum length (this would fit with the error message), but the same code worked for him.

Any help is welcomed.

Thank you in advance!

torognes commented 2 years ago

Hi

The error message relates to the merged.fq FASTQ file. In the FASTQ files, each entry consists of 4 lines. The second line contains the actual sequence, while the fourth line contains the quality score symbols. These two lines have to be exactly the same length, since each quality score symbol corresponds to each nucleotide symbol. Here, it seems like the number of symbols on line 28205492 (quality score symbols) is different from the number of symbols on line 28205490 (nucleotide symbols). It may be due to a truncated file or some other error.

I am not sure I understand your description, but if the merged.fq file is only 10290675 lines long there is something more seriously wrong. Could you run wc -l merged.fq to confirm this?

The message Lengths min 43, lo_quartile 251, median 251, hi_quartile 251, max 251 relates to the length distribution in general of the sequences, indicating that almost all of them are 251 nucleotides long, but that the shortest one is just 43 nucleotides long. This is not related to the error message you got.

elenu commented 2 years ago

Hi

Thank very much you for answering all the points.

I have run the wc -l merged.fq line, and the result is:


28205491 merged.fq

I understand this means there's only 28205491 lines and the error message is related to it has to stop due to there's no proper ending-line in the merged.fq file. Then, might this be the due to the previous usearch line? Do you think it is possible to add a proper ending-line to the merged.fq file manually? Thank you.

torognes commented 2 years ago

Sounds like that's the reason, yes.

This command should fix it:

echo >> merged.fq

Be careful to include the double > characters, otherwise the file will become empty.

elenu commented 2 years ago

By now it returns another error message that says the file is too long.

I'm trying now sed -i '' merged.fq.

torognes commented 2 years ago

You could run tail merged.fq to see what the end of the file looks like.

elenu commented 2 years ago

Good point. It has returned this outcome:

+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@113.7051372;ee=0.029;
GTGTCAGCCGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCGAGTTAAGTCAGCGGTAAAAGCCCGGGGCTCAACCCCGGCCCGCCGTTGAAACTGGCTGGCTTGAGTTGGGGAAAGGCAGGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACCCCGATTGCGAAGGCAGCCTGCCGGCCCCACACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACAGGATTAGAAACCCTAGTAGTCC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@113.7051373;ee=0.032;
GTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGAAGTAAGTCTGGAGTGAAAGGCGGGGGCCCAACCCCCGGACTGCTCTGGAAACTATTTGACTGGAGTGCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCGTGTAGTCC
+
FFFFFFFFFFFFFFFFFFFFFFFFF

It seems the quality score symbols have suddenly stop in the last case.

torognes commented 2 years ago

Yes, seems like the file is truncated.

Perhaps you need to rerun the previous step.

elenu commented 2 years ago

I totally agree.

I observe the @113 annotation corresponds to the last sample that has been processed (15 samples out of 160 in total).

I'll focus on the previous step, thus.

Thank you for all the help!

frederic-mahe commented 2 years ago

regression tests based on these comments were added to the vsearch test suite https://github.com/frederic-mahe/vsearch-tests/commit/e1a4de16cd5d185db0beebb3f750ffb99de6bd45