Closed martin-hartmann closed 2 years ago
Hi, thanks for the suggestion.
Yes, this is something we will consider implementing. I'll look into the details. It appears to be relatively simple to implement based on the existing code.
This is great and so much needed. Thanks a lot.
Dear Torbjørn
Hope all is well. Any update on this topic? It would be great to have this option included.
Kind regards, Martin
I am sorry but haven't had time to work on this yet. I'll have more time for development this autumn, and it's one of the features I plan to implement.
Thank you very much!
I've now added an option called --lcaout
that I hope will satisfy your needs. It is is in the latest commit, but not in the release yet.
With this option one can specify a file to which taxonomic information about the hits of a search using --usearch_global
(or search_exact
) is written. It is in a two column format where the query id is found in the first column and the taxonomic lineage is found in the second column. It can be used when searching a database that contains taxonomic information similar to that used with the sintax command, e.g. tax=k:Archaea,p:Euryarchaeota,c:Halobacteria
. The accepted taxonomic ranks are indicated with the letters d k p c o f g s
. In the second column of the output file the taxonomic ranks assigned to each query is shown, e.g. k:Archaea,p:Euryarchaeota
. Only the ranks that are fully identical across all matches for each query are included.
For this to produce interesting results the --maxaccepts
parameter needs to be different from 1. One also could consider using the --top_hits_only
option to only consider the equally best matching hits. The --id
option is also important, and one could also change the definition using the --iddef
option. The right combination of these parameters will depend on the circumstances.
Any feedback is highly appreciated.
Fantastic news. Thanks for implementing it. I tested it and it seesm to work well. At this point, I see only one very useful option. Could we implement a threshold regarding how many hits at a given rank have to be identical to be counted? I have scenarios where almost all hits have the same assignment at a given rank, but only 1 or 2 sequences not. Usually these are poorly annotated sequences, e.g. "d:Bacteria" without any further rank or sometimes maybe wrongly labeled sequences. I think it would be useful to, for example, read out a rank if 95% of all the reads are identical at this rank.
What do you think? Thanks again!!!
I was thinking to add an option like that, and I have included it now in the latest commit. You can specify the fraction of matching hits required with the --lca_cutoff
option. By default the threshold is 1.0. It must be larger than 0.5.
Excellent. It works as it should. Thanks for implementing all of this. Kind regards, Martin
Feature added in version 2.19.0 just released.
Sorry to open this again, Torbjørn. The --top-hits-only
option is not working when producing the --lcaout
output. Still all hits (above the specified identity level) are considered and not only the equally best matching hits. I use the latest version (2.21.1). Would this be an easy fix in the code?
Thanks a lot, Martin
Thanks for reporting this problem. I am looking into it.
The problem with --top_hits_only
should be fixed in commit ef7f0c4959a9115df862fcc11402625f69699b65. I also found a bug that could cause wrong results to be potentially reported when using --lca_cutoff
. Previously vsearch would compare the taxonomy of the first accepted sequence to the others and keep the taxonomy levels that where similar in the requested fraction of the matches. However, if the first accepted sequence was not a good representative and had limited taxonomic information, the result would not be accurate. With the new code, this problem should be gone as it first finds the taxonomic names at the different levels that are in majority and subsequently checks if they are found in the required fraction of the matches.
Fixed in version 2.21.2 just released.
Excellent, thanks for fixing this so rapidly. I just tested it and it works as it should in my example. Greatly appreciated.
Dear Torbjørn
Would it be possible to implement an LCA (last common ancestor) approach to the output of usearch_global that is run against a taxonomic reference databases (e.g. in the same format as used for sintax).
Using the SINTAX classifier sometimes yields very strange results when comparing it to the sequence hits against the DB using usearch_global. Thus, another classiferi based on LCA of the usearch_global output could be a great addition.
Thanks for considering it. Kind regards, Martin