Closed hempelc closed 2 weeks ago
Hi!
Thanks for reporting this problem.
It's good that you have included the example files, but you could please also provide the exact command you used?
The sintax algorithm has a random element to it and the confidence values you see may vary slightly from one run to another unless you set a specific random seed with the --randseed
option and use only a single thread with the --threads 1
option. Could this be the reason for the variation you see?
Hello Torbjørn,
Thanks for getting back to me so quickly!
It was indeed the --randseed
option that made the difference!
Here is the command I originally ran:
vsearch --sintax example_sequence.fasta --db reference_with_domain.fasta --tabbedout withDomain.tsv
But when I added --randseed
as so:
vsearch --sintax example_sequence.fasta --db reference_with_domain.fasta --tabbedout withDomain.tsv --randseed 1
, the results were identical between the references with and without domain (running both with the --randseed 1
option).
I checked the vsearch --help
documentation but the --randseed
option is only listed under "Subsampling" and "Shuffling and sorting", not under "Taxonomic classification". I was not aware that the sintax algorithm involves this option; it might be helpful for sintax users if you added that option under "Taxonomic classification" as well.
Thanks a lot for this easy fix!
Best, Chris
Hello @hempelc thank you for your feedback. The help message (https://github.com/torognes/vsearch/commit/25ad4019e572f5b73f3ae08e442e8a3d09270a40) and the manpage have been updated:
This should be included in the next vsearch
release.
Hello,
I'm using vsearch sintax to assign sequences taxonomically. One of the reference databases I'm using does not contain the domain rank for references, just the phylum rank and upwards. I've been asked to manually add the domain rank, so I did, but I noticed that the confidence values of the sintax results have changed slightly for some sequences.
I have attached example files for reproducibility here. When I assign taxonomy to example_sequence.fasta using reference_no_domain.fasta, the result is:
p:Ascomycota(1.00),c:Eurotiomycetes(1.00),o:Eurotiales(1.00),f:Trichocomaceae(1.00),g:Talaromyces(0.69),s:Talaromyces_marneffei(0.69)
However, when I use reference_with_domain.fasta, I get:
d:Eukaryota(1.00),p:Ascomycota(1.00),c:Eurotiomycetes(1.00),o:Eurotiales(1.00),f:Trichocomaceae(1.00),g:Talaromyces(0.70),s:Talaromyces_marneffei(0.70)
Note that the confidence value decreased by 0.01 for the genus and species level. I have observed some cases in which the decrease is even bigger. Note that all sequences in the references are from Eukaryota in this example.
I've tried to understand the algorithm via the usearch website and sintax paper but was unable to find anything that hinted at an explanation for this. Would you be able to explain to me how the addition of the domain rank impacts the confidence values? Thank you!