Unintuitive behaviour of spaced seeds in MMseqs2 search

tischulz1 commented 4 years ago

Dear MMseqs2 team,

I got some wired results which I could not explain by myself. I hope you can help me with it.

Expected Behavior

I was expecting MMseqs2 to be more sensitive if using default options (spaced-kmer-mode enabled and kmer-matching disabled).

Current Behavior

Using MMseqs2 search with default options (spaced-kmer-mode enabled and kmer-matching disabled), the program found less results than if disabling spaced-kmer-mode and enabling kmer-matching.

Context

I thought that MMseqs2 uses spaced seeds and no exact k-mer matching to increase the sensitivity during search. I was curious to see how many alignments are found by MMseqs2 exclusively because of this. Therefore, I performed two searches with MMseqs2 search either using spaced seeds and no exact k-mer matching or the opposite. Surprisingly, I looks like using no spaced seeds and an exact k-mer matching increases the programs sensitivity as there are more results found.

Do you have an explanation for this results?

milot-mirdita commented 4 years ago

The raw number of results is usually not a very good indicator of sensitivity. Both of these parameters affect the false positive rate:

--exact-kmer-matching 1 will force MMseqs2 to compare every kmer once. If we instead generate similar k-mer lists, each similar k-mer (including the original k-mer) has to reach the k-mer similarity threshold (as specified by the sensitivity parameter -s), the k-mer threshold is also corrected by composition bias. Highly biased regions will have a harder time generation similar k-mers that we deem acceptable due to possible false positives.
Disabling spaced k-mers can also raise the number of reported false positives, due to k-mer self-correlation. The MMseqs2 prefiltering algorithm needs to find two consecutive k-mers to accept a hit and pass it along to the ungapped and gapped alignment.

We took a lot of care to control for false positives. Controlling for FPs is especially important to us since we also do iterative profile searches and building profiles with false positives included heavily degrades sensitivity.

If you want to find all exact matches, you could try the map workflow, which disables all FP controlling parameters.

tischulz1 commented 4 years ago

Thanks for the clarification. I will have a look on the map workflow.

milot-mirdita commented 4 years ago

No problem, please reopen the issue if any questions remain.

If you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your address to milot at mirdita de.

soedinglab / MMseqs2