soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.32k stars 185 forks source link

2bLCA and top hit (--lca-mode) differ in search sensitivity #465

Open apcamargo opened 3 years ago

apcamargo commented 3 years ago

I'm comparing MMSeqs2 taxonomic assignment with approx. 2bLCA and top hit and noticed that the later approach classifies more genes than the former. I extracted the alignments using --extract-lines 1 and the top hit search had more hits to the database. All parameters were the same with the exception of --lca-mode.

Example:

mmseqs taxonomy querydb/querydb gtdb_r202/gtdb_r202 taxonomydb/taxonomydb tmp -s 3.0 --lca-mode 3 --tax-output-mode 2 --threads 64

Is behavior expected? If so, what is causing this difference?

I'm using release 13-45111.

Thanks!

liubovch commented 1 year ago

Hi! I have the same question: why do we see this difference?

In the manual, I only find this description to the --lca-mode 4: "the lowest common ancestor of all equal scoring top hits". Is "top hit" the same as "best hit"? If so, then it looks to me like the realignment step is skipped. But then, it looks the same as the "single search LCA" (--lca mode 1). Could you please elaborate a bit on different modes?

Thanks in advance,