torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

fastq_mergepairs Fatal error: More reverse reads than forward reads #512

Closed andrewd789 closed 1 year ago

andrewd789 commented 1 year ago

Hello,

I have encountered a puzzling error while using -fastq_mergepairs in vsearch v2.21.1_linux_x86_64. It fails with the error message "Fatal error: More reverse reads than forward reads", in some cases where there are indeed more reverse reads than forward reads. But not in other cases where there are also more reverse than forward reads. For example, this produces the fatal error:

grep -c @ Demultiplexed_LCO1490-FC/1001_pos2_r1.fq
18
grep -c @ Demultiplexed_LCO1490-FC/1001_pos2_r2.fq
29
vsearch -fastq_mergepairs 1001_pos2_r1.fq -reverse 1001_pos2_r2.fq -fastaout Testing.fasta # Fatal error

So does this:

grep -c @ 1004_pos_r1.fq
3822
grep -c @ 1004_pos_r2.fq
5245
vsearch -fastq_mergepairs 1004_pos_r1.fq -reverse 1004_pos_r2.fq -fastaout Testing.fasta # Fatal error

But this merges successfully, despite also having more reverse reads than forward reads:

grep -c @ 1003_pos_r1.fq
1778
grep -c @ 1003_pos_r2.fq
1908
vsearch -fastq_mergepairs 1003_pos_r1.fq -reverse 1003_pos_r2.fq -fastaout Testing.fasta # Works

Why does this error occur only in some cases with more reverse reads than forward reads?

torognes commented 1 year ago

Are you sure about the number of entries in these files? In general, using grep -c @ filename.fastq is not a reliable indicator of the number of entries in a FASTQ file because the @ character may also appear in the fourth line as a base quality indicator. I would recommend using wc -l filename.fastq and divide by four, which is usually correct, but not technically in all cases.

If you still find that vsearch does not work correctly, I could investigate, but then it would be an advantage if I could have a look at the file, or a small example file that triggers the incorrect behaviour.

frederic-mahe commented 1 year ago

@andrewd789 could you please try with grep -c "^@" filename.fastq to anchor the pattern and limit the number of false-positives?

@ is Q = 31 in the Phred+33 range of quality values.

frederic-mahe commented 1 year ago

also, vsearch --fastq_stats filename.fastq can reliably count the number of fastq entries.

andrewd789 commented 1 year ago

Humble apologies, I forgot that @ could also be a fastq quality value. Using grep -c @M I can now see that file pairs that do merge have equal numbers of r1 and r2 reads, and those that do not merge have unequal numbers of r1 and r2 reads.

This does make me wonder, however, why differing numbers of sequence reads are necessarily a problem? I imagine one could match the sequence headers in an r1 file with those in an r2 file and thus identify the reads occurring in both that can be merged?

torognes commented 1 year ago

It could be done, but would make the merging slower and more complicated. Usearch has a command called fastx_syncpairs than can be used to prepare read files for merging, in case there are different number of reads in the two files. It will sort reads and discard unpaired reads.

andrewd789 commented 1 year ago

Thank you for the explanation, and for pointing out fastx_syncpairs. Perhaps it would be good for this function to be added to VSEARCH someday?