soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 191 forks source link

MMseqs search not finding exact and close-exact hits #842

Open mcn3159 opened 5 months ago

mcn3159 commented 5 months ago

Expected Behavior

Searching proteins against a database with similar and exact proteins (from bacterial refseq proteome) should return hits with similar and exact matches.

Current Behavior

Running mmseqs search returns few to no hits. However easy-search does output way more hits (an expected amount).

Steps to Reproduce (for bugs)

For mmseqs search:

For mmseqs easy-search:

MMseqs Output (for bugs)

MMseqs search output: https://gist.github.com/mcn3159/9a5ed05852e2e83b8656d25f0333a8f3

Context

I am searching a fasta of known bacterial proteins against the bacterial refseq WP proteome. I noticed that only half of my original virulence proteins (out of ~8000) had hits against refseq. Refseq proteome is large so I found a minimal example where there is an exact match (as well as similar according to easy-search) between the target and query databases that mmseqs search doesn't seem to find, but easy-search does.

I can provide the larger fastas if more examples to replicate are necessary.

There are 2 fastas in the attached .zip file each containing 4 proteins, one of those is an exact match (same WP_number) and 2 proteins (WP_000633131.1 and WP_000633136.1) are very similar to the protein with the exact match.

fastas_to_search.zip query fasta = query_subset.faa target_fasta = 406_subset.faa

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 5 months ago

The trap is likely the sequence identity estimation (see https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity).

Adding -a or --alignment-mode 3 fixes the issue. easy-search better detects when exact sequence identity is required, search does the sequence identity estimation by default and try to detect it.