Open mcn3159 opened 6 months ago
The trap is likely the sequence identity estimation (see https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity).
Adding -a
or --alignment-mode 3
fixes the issue. easy-search
better detects when exact sequence identity is required, search
does the sequence identity estimation by default and try to detect it.
Expected Behavior
Searching proteins against a database with similar and exact proteins (from bacterial refseq proteome) should return hits with similar and exact matches.
Current Behavior
Running mmseqs search returns few to no hits. However easy-search does output way more hits (an expected amount).
Steps to Reproduce (for bugs)
For mmseqs search:
For mmseqs easy-search:
MMseqs Output (for bugs)
MMseqs search output: https://gist.github.com/mcn3159/9a5ed05852e2e83b8656d25f0333a8f3
Context
I am searching a fasta of known bacterial proteins against the bacterial refseq WP proteome. I noticed that only half of my original virulence proteins (out of ~8000) had hits against refseq. Refseq proteome is large so I found a minimal example where there is an exact match (as well as similar according to easy-search) between the target and query databases that mmseqs search doesn't seem to find, but easy-search does.
I can provide the larger fastas if more examples to replicate are necessary.
There are 2 fastas in the attached .zip file each containing 4 proteins, one of those is an exact match (same WP_number) and 2 proteins (WP_000633131.1 and WP_000633136.1) are very similar to the protein with the exact match.
fastas_to_search.zip query fasta = query_subset.faa target_fasta = 406_subset.faa
Your Environment
Include as many relevant details about the environment you experienced the bug in.