torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
671 stars 125 forks source link

--uchime_denovo takes abundance information into account #537

Closed frederic-mahe closed 1 year ago

frederic-mahe commented 1 year ago

on the vsearch forum, user Emily Van Syoc asked how to interpret the summary message produced by the --uchime_denovo command.

Here is a toy-example with three unique sequences (or cluster representatives), representing 100 reads in total: parentA with 50 reads; parentB with 49 reads; and chimeraAB with 1 read. chimeraAB is a chimera of parentA and parentB:

#        1...5...10...15...20...25...30...35
A_START="TCCAGCTCCAATAGCGTATACTAAAGTTGTTGC"
B_START="AGTTCATGGGCAGGGGCTCCCCGTCATTTACTG"
A_END=$(rev <<< ${A_START})
B_END=$(rev <<< ${B_START})

(
    printf ">parentA;size=50\n%s\n" "${A_START}${A_END}"
    printf ">parentB;size=49\n%s\n" "${B_START}${B_END}"
    printf ">chimeraAB;size=1\n%s\n" "${A_START}${B_END}"
) | \
    vsearch \
        --uchime_denovo - \
        --uchimeout /dev/null
Found 1 (33.3%) chimeras, 2 (66.7%) non-chimeras,
and 0 (0.0%) borderline sequences in 3 unique sequences.
Taking abundance information into account, this corresponds to
1 (1.0%) chimeras, 99 (99.0%) non-chimeras,
and 0 (0.0%) borderline sequences in 100 total sequences.

As expected, one of the three sequence is marked as a chimera (33.3%). When taking into account the number of reads each sequence represents, the percentage of the dataset marked as chimeric is only 1% (1 read out of 100). Discarding chimeras preserves 99.0% of the initial dataset (99 reads, represented by parentA and parentB).

frederic-mahe commented 1 year ago

test added to our test suite https://github.com/frederic-mahe/vsearch-tests/commit/e918c168915525277de74d91d16144a45c55a46a