Implementation of usearch_global LCA?

martin-hartmann commented 3 years ago

Dear Torbjørn

Would it be possible to implement an LCA (last common ancestor) approach to the output of usearch_global that is run against a taxonomic reference databases (e.g. in the same format as used for sintax).

Using the SINTAX classifier sometimes yields very strange results when comparing it to the sequence hits against the DB using usearch_global. Thus, another classiferi based on LCA of the usearch_global output could be a great addition.

Thanks for considering it. Kind regards, Martin

torognes commented 3 years ago

Hi, thanks for the suggestion.

Yes, this is something we will consider implementing. I'll look into the details. It appears to be relatively simple to implement based on the existing code.

martin-hartmann commented 3 years ago

This is great and so much needed. Thanks a lot.

martin-hartmann commented 3 years ago

Dear Torbjørn

Hope all is well. Any update on this topic? It would be great to have this option included.

Kind regards, Martin

torognes commented 3 years ago

I am sorry but haven't had time to work on this yet. I'll have more time for development this autumn, and it's one of the features I plan to implement.

martin-hartmann commented 3 years ago

Thank you very much!

torognes commented 3 years ago

I've now added an option called --lcaout that I hope will satisfy your needs. It is is in the latest commit, but not in the release yet.

With this option one can specify a file to which taxonomic information about the hits of a search using --usearch_global (or search_exact) is written. It is in a two column format where the query id is found in the first column and the taxonomic lineage is found in the second column. It can be used when searching a database that contains taxonomic information similar to that used with the sintax command, e.g. tax=k:Archaea,p:Euryarchaeota,c:Halobacteria. The accepted taxonomic ranks are indicated with the letters d k p c o f g s. In the second column of the output file the taxonomic ranks assigned to each query is shown, e.g. k:Archaea,p:Euryarchaeota. Only the ranks that are fully identical across all matches for each query are included.

For this to produce interesting results the --maxaccepts parameter needs to be different from 1. One also could consider using the --top_hits_only option to only consider the equally best matching hits. The --id option is also important, and one could also change the definition using the --iddef option. The right combination of these parameters will depend on the circumstances.

Any feedback is highly appreciated.

martin-hartmann commented 3 years ago

Fantastic news. Thanks for implementing it. I tested it and it seesm to work well. At this point, I see only one very useful option. Could we implement a threshold regarding how many hits at a given rank have to be identical to be counted? I have scenarios where almost all hits have the same assignment at a given rank, but only 1 or 2 sequences not. Usually these are poorly annotated sequences, e.g. "d:Bacteria" without any further rank or sometimes maybe wrongly labeled sequences. I think it would be useful to, for example, read out a rank if 95% of all the reads are identical at this rank.

What do you think? Thanks again!!!

torognes commented 3 years ago

I was thinking to add an option like that, and I have included it now in the latest commit. You can specify the fraction of matching hits required with the --lca_cutoff option. By default the threshold is 1.0. It must be larger than 0.5.

martin-hartmann commented 3 years ago

Excellent. It works as it should. Thanks for implementing all of this. Kind regards, Martin

torognes commented 2 years ago

Feature added in version 2.19.0 just released.

martin-hartmann commented 2 years ago

Sorry to open this again, Torbjørn. The --top-hits-only option is not working when producing the --lcaout output. Still all hits (above the specified identity level) are considered and not only the equally best matching hits. I use the latest version (2.21.1). Would this be an easy fix in the code? Thanks a lot, Martin

torognes commented 2 years ago

Thanks for reporting this problem. I am looking into it.

torognes commented 2 years ago

The problem with --top_hits_only should be fixed in commit ef7f0c4959a9115df862fcc11402625f69699b65. I also found a bug that could cause wrong results to be potentially reported when using --lca_cutoff. Previously vsearch would compare the taxonomy of the first accepted sequence to the others and keep the taxonomy levels that where similar in the requested fraction of the matches. However, if the first accepted sequence was not a good representative and had limited taxonomic information, the result would not be accurate. With the new code, this problem should be gone as it first finds the taxonomic names at the different levels that are in majority and subsequently checks if they are found in the required fraction of the matches.

torognes commented 2 years ago

Fixed in version 2.21.2 just released.

martin-hartmann commented 2 years ago

Excellent, thanks for fixing this so rapidly. I just tested it and it works as it should in my example. Greatly appreciated.

torognes / vsearch

Implementation of usearch_global LCA? #441