torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

always report the rightmost match if multiple equivalent occurrences are present in target sequence? No. #530

Closed frederic-mahe closed 6 months ago

frederic-mahe commented 10 months ago

It seems that, if multiple equivalent occurrences are present in a target sequence, the match happens on the rightmost occurrence (see test below). If this is always correct, it could be interesting to add that information to vsearch's documentation.

SEQUENCE="TCAAGATATTTGCTCGGTAA"

# one occurrence
vsearch \
    --usearch_global <(printf ">s1\n%s\n" "${SEQUENCE}") \
    --db <(printf ">q1\n%s\n" "${SEQUENCE}") \
    --minseqlength 1 \
    --id 1.0 \
    --quiet \
    --userfields target+tilo+tihi \
    --userout -

# two occurrences
vsearch \
    --usearch_global <(printf ">s1\n%s\n" "${SEQUENCE}") \
    --db <(printf ">q1\n%s%s\n" "${SEQUENCE}" "${SEQUENCE}") \
    --minseqlength 1 \
    --id 1.0 \
    --quiet \
    --userfields target+tilo+tihi \
    --userout -

# three occurrences
vsearch \
    --usearch_global <(printf ">s1\n%s\n" "${SEQUENCE}") \
    --db <(printf ">q1\n%s%s%s\n" "${SEQUENCE}" "${SEQUENCE}" "${SEQUENCE}") \
    --minseqlength 1 \
    --id 1.0 \
    --quiet \
    --userfields target+tilo+tihi \
    --userout -

# four occurrences
vsearch \
    --usearch_global <(printf ">s1\n%s\n" "${SEQUENCE}") \
    --db <(printf ">q1\n%s%s%s%s\n" "${SEQUENCE}" "${SEQUENCE}" "${SEQUENCE}" "${SEQUENCE}") \
    --minseqlength 1 \
    --id 1.0 \
    --quiet \
    --userfields target+tilo+tihi \
    --userout -

unset SEQUENCE

The reported match is always the rightmost occurrence on the target sequence:

q1  1   20
q1  21  40
q1  41  60
q1  61  80
frederic-mahe commented 10 months ago

tests added (https://github.com/frederic-mahe/vsearch-tests/commit/b7beb67928ec551994a5222b05bf09ce93beb9d2)

frederic-mahe commented 6 months ago

@torognes it seems to me that the rule is to return the first perfect match found during backtracking? (so the rightmost one on the target sequence). If it is a permanent feature, I would like to mention it in the documentation. Is that ok for you?

Side question: I've been testing the same idea, but with matches on the minus strand of the target sequence. I've observed the same pattern (see https://github.com/frederic-mahe/vsearch-tests/commit/70a6808caf52091c4001eb7249fba47f441b4c6b), which suggests that when searching on the opposite strand, the query sequence is reverse-complemented and the target sequence is left unchanged. Is that correct?

torognes commented 6 months ago

Unfortunately, it is more complex, and the results depends on the rest of the sequences.

A counter-example: If the target/database sequence in the example had some extra non-matching sequence at the 3' end, the first match would be chosen. This is because these are global alignments and when a gap has to be opened at the 3' end anyway, it would rather extend that gap than opening an additional gap in 5' end, because the score would be better.

torognes commented 6 months ago

When aligning and backtracking, the code will always prefer to match the sequences than opening a gap, given that the scores are equal. It starts backtracking at the 3' ends of the sequences, and in the examples it will always start by matching the sequences at the 3' end.

frederic-mahe commented 6 months ago

Thanks @torognes I've updated our test-suite (https://github.com/frederic-mahe/vsearch-tests/commit/bae529a03dca64b9a07d03352c48e14d49dc3aa7) with counter-examples. This is a complex behavior, and I don't see an easy way to describe it in the manpage, at least for now. That issue is complete and can be closed.