torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
670 stars 125 forks source link

Mismatches in taxonomic ranks with Sintax #573

Closed ashleyp1 closed 3 weeks ago

ashleyp1 commented 2 months ago

I encountered some confusing results while testing sintax on my data. I'm running v 2.28.1 on near full length 16S amplicons against a custom database. For some of my samples (mostly ones without high confidence values) I get mixed taxonomies that seem to jump around, like below.

0faf4970-8f6a-4a6c-9d55-26f7c80d50fc d:Bacteria(1.00),p:Firmicutes(1.00),c:Bacilli(1.00),o:Bacillales(0.83),g:Exiguobacterium(0.48),s:Exiguobacterium_acetylicum(0.24)
5f9d0909-fe7d-409d-9da8-26c2749bb0cc d:Bacteria(1.00),p:Firmicutes(1.00),c:Bacilli(1.00),o:Bacillales(1.00),g:Exiguobacterium(1.00),s:Exiguobacterium_acetylicum(0.74)
37270a98-6e0c-4130-ae8d-8c47399abcdd d:Bacteria(1.00),p:Firmicutes(1.00),c:Bacilli(1.00),o:Bacillales(0.99),f:Listeriaceae(0.60),g:Listeria(0.60),s:Exiguobacterium_acetylicum(0.25)
0aa23c22-ff54-4b20-8663-ef25a6338227 d:Bacteria(1.00),p:Proteobacteria(0.59),c:Gammaproteobacteria(0.58),o:Enterobacterales(0.57),f:Enterobacteriaceae(0.52),g:Exiguobacterium(0.36),s:Salmonella_enterica(0.29)

The first two show the lineage that I would expect for Exiguobacterium, but how did it go from Listeria to Exiguo and Exiguo to Salmonella on the next two?

I thought it was an error in my database at first, but I checked and confirmed that the lineages are all correct and formatted properly. At this point, I assume this is most likely a fault in my understanding of how sintax works and I know that the bootstrap values for those two are low enough I probably won't use them, but I'd still like to understand how this is happening.

Thanks!

torognes commented 2 months ago

Hi, thank you for reporting this issue!

This does not look right.

Although taxonomic ranks with low-confidence, e.g. with values below 0.8, should not be trusted, the classifications should not jump between different clades in the tree as you go down to the species level.

I'll look deeper into the issue as soon as possible.

Could you please send me the exact command you ran?

Would it be possible to send me (a subset of) the queries and the database used? Or is it confidential?

ashleyp1 commented 2 months ago

Here is the command I used. I sent you an invite to a dropbox folder with my database and the sample I first found the issue in. Thanks for looking into this!

vsearch --sintax \
    1-filt-trimmed-HL068_FW.fastq.gz \
    --db sintax_db.fasta \
    --tabbedout 1-68_sintax.tsv \
    --sintax_cutoff 0.7 --strand both -notrunclabels
torognes commented 2 months ago

Thank you, I'll look into it. Got the data.

torognes commented 2 months ago

There was a logical bug in the selection of the best lineages. It should be fixed now in commit aa94d1c. I think it should only appear when the confidence is below 0.5, so it shouldn't matter much in most cases, although it was confusing.

I will make a new release soon with this fix.

Sorry for the bug and thank you very much for reporting this issue!

torognes commented 2 months ago

BTW, I'll recommend using the --sintax_random option to avoid length bias in the taxonomic classification.

torognes commented 1 month ago

The fixes are available now in release 2.29.0:

https://github.com/torognes/vsearch/releases/tag/v2.29.0