Chimera detection --uchime_ref unexpected behaviour

nvucic commented 1 year ago

Hello,

I've been using --uchime_ref to identify chimeras providing a list of reference sequences in FASTA format. I noticed that vsearch does not report match with highest score when there are large number of sequences (~5k) between second best and best match in cases where second match comes first in the FASTA file. I could provide test data if necessary.

Can you suggest any kind of workarund or fix for this?

Best, Nemanja

frederic-mahe commented 1 year ago

Thank you for reporting this. Could you please provide some test data and the exact command line used? I'll try to reproduce the issue, and then make a minimal reproducible example if possible.

nvucic commented 1 year ago

Sure,

First command reports chimera with lower score vsearch --uchime_ref test_ampl.fasta --db test_design.fasta --uchimeout test.chimera_table.not_expected.tsv

When I move desired parent A to the top of the design FASTA the results are as expected vsearch --uchime_ref /opt/test_data/test_ampl.fasta --db /opt/test_data/test_design.rep.fasta --uchimeout test.chimera_table.expected.tsv

frederic-mahe commented 1 year ago

Thanks, I confirm there is something strange here, maybe a bug. The wrong parent A is selected when there are more than 17,195 entries between the two candidates for parent A.

filler.zip

QUERY="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC"
PARENT_A_HIGH="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACCGCGTACCTGCAGATGAACAGCCTGCGTGCGGAAGATACGGCGGTTTACTATTGCTCGCGCCACGGCGGTGACGGCCACTACGCCATGGAC" # SAMP23224_sim_16939
PARENT_A_LOW="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAGGGTCGTTTCACGATCTCGGCCGATACGTCGAAGAACACCGCGTACTTACAGATGAACAGCCTGCGCGCGGAAGACACGGCGGTGTATTACTGTAGCCGCTGGGGCGGCGCCCTGTTCTACGCGATGGAC" # SAMP23224_sim_13735
PARENT_B="TATCTACCCGACCAATGGCTATACGCGCTACGCCGACTCGGTTAAAGGTCGCTTCACGATCTCGGCGGATACGAGCAAGAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC" # SAMP23224_sim_16714
FILLER_FILE="filler.fasta"  # 37,198 lines

## PARENT_B PARENT_A_LOW ... no simulated entries ... PARENT_A_HIGH
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           # cat ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 2.4872

## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34390
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           head -n ${N_FILLER} ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 2.4872

## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34392
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           head -n ${N_FILLER} ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 0.9055

I don't observe that pattern when using monotonous sequences (only 'A') as filler sequences.

torognes commented 1 year ago

I too can confirm that there seems to be a bug here. I'm looking into it.

torognes commented 1 year ago

I've looked at the examples provided by @frederic-mahe, and for those examples I do not think a bug is exposed. It is rather a consequence of the heuristics of the algorithm. Actually, sequence number 17196 in the filler.fasta file with the label SAMP23224_sim_17199 has very strong similarity to the query sequence in one region, causing it to be selected before the best parent sA_high, even though the former is worse overall.

The uchime algorithm divides the query into 4 equally long parts and initially finds a few (actually 4 in vsearch) candidate sequences that are very similar to each part. It then goes on to find the best pair of parents among those (up to 16 sequences).

What happens in @frederic-mahe's example is that a the SAMP23224_sim_17199 sequence is selected among those four for the second part (nucleotides 45-88) because it is 100% identical there. The overall best sequence (sA_high) is slightly less similar in that part (actually 95.45%), and is not among the 4 best hits in that region. In the first region (1-44 bp), they are both 100% identical, and the algorithm then selects the entry that comes first in the file.

I am not sure whether the same happens in the original case presented by @nvucic, but it may be related. Perhaps there are several sequences that are all 100% identical in one of the parts. Then the sequences selected may depend on the order in the file.

So it may be due to the heuristics.

To improve the situation, we could increase the number of sequences considered in each region (which would take more time), or we could sort them differently.

I am currently working on an updated chimera algorithm for long reads, and will have these issues in mind.

nvucic commented 1 year ago

To improve the situation, we could increase the number of sequences considered in each region (which would take more time), or we could sort them differently.

@torognes just enabling this as a parameter would be very useful.

Thank you for looking into this!

torognes / vsearch

Chimera detection --uchime_ref unexpected behaviour #504