torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
655 stars 122 forks source link

sintax classifier and multiple identical best hits #325

Open diegomic opened 6 years ago

diegomic commented 6 years ago

Dear @torognes,

Using the sintax xlassifier I noticed that the algorithm in case of multiple identical best hits only outputs the first hit irrespective of the hits after that. This may results in an wrong classification is more species have the same sequence in the reference db. Probably in these cases it would be better to report the least common ancestor of the ambigous hits. A similar issue was already reported in the issue #210 by @andzandz11. Thank you very much cheers Diego

colinbrislawn commented 6 years ago

This is fascinating. The sintax algorithm was designed to mitigate over-classification, so I had to go back to the preprint to take a look at why this could be happening.

SINTAX algorithm For a query sequence Q and reference database R...

Turns out that the subsampling is used on each query sequence, but the reference database is not subsampled or shuffled. So sintax is unable to choose between two identical reads in the reference database.

This makes sense to me; If your database includes identical references (in the area sequenced), no tax assigner will be able to tell them apart, because they are identical!

I guess the goal would be to detect and report these multiple best hits (like with a blast output #210), or report a lower confidence for this prediction.

Colin

torognes commented 6 years ago

I will consider trying to improve the sintax algorithm at a later time.

cjfields commented 4 years ago

Just a note that I am also seeing something that is likely due to this issue. I recently did a (rough) comparison of Illumina V4 and PacBio full length 16S using three classifiers; SINTAX gave almost equivalent results for both while dada2 and QIIME2 showed significant differences based on the length of the target, which I expected. In particular the species level assignment was very high (>60%) for the ~250nt V4 region.

torognes commented 4 months ago

I have made several improvements to the sintax command in vsearch 2.28.1, just released. Please see issue #535 or the release notes for details.