torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Perfect overlaps not getting merged together. #491

Closed FlintMitchell closed 1 year ago

FlintMitchell commented 2 years ago

I have some paired-end reads that overlap, but I am unable to merge any of the reads. For example, taking one read from these files:

test_R1.fastq:

@xxx:1010:xxx
CAGGTCCATCGATTGTTTCTGCGGACGGTGTTGTCCTCATAGTTTGGGCATGTTTCGCTTCCAGCCCAGCCAAACTTGTCAACCAGTATCCCGGTGCAGGAGCTGCACATACTAGCCCCTGTCTAGGACCCGCTGTCCTATAACGAAATCT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

test_R2.fastq:

@xxx:1010:xxx
TCTGCTGCTCCCCGGGTGTGGCTCCTTCATCTGACAACGTGCAACCCCTATCGCGATGGCAAAGGAAAGGAAGCCCTGCTTCCTCCAGATTTCGTTATAGGACAGCGGGATCTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF::,F,F:FFF::F:F:F::FFFF::::,:FFF:F,FFFF

These two reads have perfectly matching 23 bp reverse-complement sequences (bolded):

TCTGCTGCTCCCCGGGTGTGGCTCCTTCATCTGACAACGTGCAACCCCTATCGCGATGGCAAAGGAAAGGAAGCCCTGCTTCCTCCAGATTTCGTTATAGGACAGCGGGATCTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

CAGGTCCATCGATTGTTTCTGCGGACGGTGTTGTCCTCATAGTTTGGGCATGTTTCGCTTCCAGCCCAGCCAAACTTGTCAACCAGTATCCCGGTGCAGGAGCTGCACATACTAGCCCCTGTCTAGGACCCGCTGTCCTATAACGAAATCT

where the overlaps match like so: ...AAGCCCTGCTTCCTCCAGATTTCGTTATAGGACAGCGGGATCTTTTCT... ___TCTAAAGCAATATCCTGTCGCCCAGGATCTGTCCCCGATC... When using: vsearch --fastq_mergepairs test_R1.fastq --reverse test_R2.fastq --fastqout testmerge.fastq

I get

Merging reads 100%  
         1  Pairs
         0  Merged (0.0%)
         1  Not merged (100.0%)

Pairs that failed merging due to various reasons:
         1  alignment score too low, or score drop to high

Statistics of all reads:
    151.00  Mean read length

Statistics of merged reads:
       nan  Mean fragment length
       nan  Standard deviation of fragment length
       nan  Mean expected error in forward sequences
       nan  Mean expected error in reverse sequences
       nan  Mean expected error in merged sequences
       nan  Mean observed errors in merged region of forward sequences
       nan  Mean observed errors in merged region of reverse sequences
       nan  Mean observed errors in merged region

Or when combining some flags that were used in other examples to lessen the strictness of the tool: vsearch --fastq_mergepairs test_R1.fastq --reverse test_R2.fastq --fastqout testmerge.fastq --fastq_allowmergestagger --fastq_maxdiffs 30 --fastq_minovlen 5 --fastq_qmin 0

results in the same thing, not merging the two files.

When I do vsearch using the above command with the fastq files that contain all of the reads, none of them merge together (out of 50k+, of which 40k+ of them I used grep to confirm that the overlapping sequence is present in them!)

Any help would be greatly appreciated.

torognes commented 2 years ago

I think it refuses to merge them due to the non-matching tail of T's (ATCTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT) that would need to be clipped. The ends of the sequences must match for the sequences to be merged.