SINTAX taxonomic annotation changes when adding domain rank

torognes / vsearch

Versatile open-source tool for microbiome analysis

Other

671 stars 125 forks source link

SINTAX taxonomic annotation changes when adding domain rank #578

Closed hempelc closed 2 weeks ago

hempelc commented 3 weeks ago

Hello,

I'm using vsearch sintax to assign sequences taxonomically. One of the reference databases I'm using does not contain the domain rank for references, just the phylum rank and upwards. I've been asked to manually add the domain rank, so I did, but I noticed that the confidence values of the sintax results have changed slightly for some sequences.

I have attached example files for reproducibility here. When I assign taxonomy to example_sequence.fasta using reference_no_domain.fasta, the result is: p:Ascomycota(1.00),c:Eurotiomycetes(1.00),o:Eurotiales(1.00),f:Trichocomaceae(1.00),g:Talaromyces(0.69),s:Talaromyces_marneffei(0.69)

However, when I use reference_with_domain.fasta, I get: d:Eukaryota(1.00),p:Ascomycota(1.00),c:Eurotiomycetes(1.00),o:Eurotiales(1.00),f:Trichocomaceae(1.00),g:Talaromyces(0.70),s:Talaromyces_marneffei(0.70)

Note that the confidence value decreased by 0.01 for the genus and species level. I have observed some cases in which the decrease is even bigger. Note that all sequences in the references are from Eukaryota in this example.

I've tried to understand the algorithm via the usearch website and sintax paper but was unable to find anything that hinted at an explanation for this. Would you be able to explain to me how the addition of the domain rank impacts the confidence values? Thank you!

torognes commented 2 weeks ago

Hi!

Thanks for reporting this problem.

It's good that you have included the example files, but you could please also provide the exact command you used?

The sintax algorithm has a random element to it and the confidence values you see may vary slightly from one run to another unless you set a specific random seed with the --randseed option and use only a single thread with the --threads 1 option. Could this be the reason for the variation you see?

Torbjørn

hempelc commented 2 weeks ago

Hello Torbjørn,

Thanks for getting back to me so quickly!

It was indeed the --randseed option that made the difference!

Here is the command I originally ran: vsearch --sintax example_sequence.fasta --db reference_with_domain.fasta --tabbedout withDomain.tsv But when I added --randseed as so: vsearch --sintax example_sequence.fasta --db reference_with_domain.fasta --tabbedout withDomain.tsv --randseed 1 , the results were identical between the references with and without domain (running both with the --randseed 1 option).

I checked the vsearch --help documentation but the --randseed option is only listed under "Subsampling" and "Shuffling and sorting", not under "Taxonomic classification". I was not aware that the sintax algorithm involves this option; it might be helpful for sintax users if you added that option under "Taxonomic classification" as well.

Thanks a lot for this easy fix!

Best, Chris

frederic-mahe commented 2 weeks ago

Hello @hempelc thank you for your feedback. The help message (https://github.com/torognes/vsearch/commit/25ad4019e572f5b73f3ae08e442e8a3d09270a40) and the manpage have been updated:

https://github.com/torognes/vsearch/blob/9ca1412aa1cf5ba31e6e35a590c7a660ba9d2182/man/vsearch.1#L3626-L3631

This should be included in the next vsearch release.