Closed tischulz1 closed 4 years ago
The raw number of results is usually not a very good indicator of sensitivity. Both of these parameters affect the false positive rate:
--exact-kmer-matching 1
will force MMseqs2 to compare every kmer once. If we instead generate similar k-mer lists, each similar k-mer (including the original k-mer) has to reach the k-mer similarity threshold (as specified by the sensitivity parameter -s
), the k-mer threshold is also corrected by composition bias. Highly biased regions will have a harder time generation similar k-mers that we deem acceptable due to possible false positives.We took a lot of care to control for false positives. Controlling for FPs is especially important to us since we also do iterative profile searches and building profiles with false positives included heavily degrades sensitivity.
If you want to find all exact matches, you could try the map
workflow, which disables all FP controlling parameters.
Thanks for the clarification. I will have a look on the map
workflow.
No problem, please reopen the issue if any questions remain.
If you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your address to milot at mirdita de.
Dear MMseqs2 team,
I got some wired results which I could not explain by myself. I hope you can help me with it.
Expected Behavior
I was expecting MMseqs2 to be more sensitive if using default options (spaced-kmer-mode enabled and kmer-matching disabled).
Current Behavior
Using MMseqs2 search with default options (spaced-kmer-mode enabled and kmer-matching disabled), the program found less results than if disabling spaced-kmer-mode and enabling kmer-matching.
Context
I thought that MMseqs2 uses spaced seeds and no exact k-mer matching to increase the sensitivity during search. I was curious to see how many alignments are found by MMseqs2 exclusively because of this. Therefore, I performed two searches with MMseqs2 search either using spaced seeds and no exact k-mer matching or the opposite. Surprisingly, I looks like using no spaced seeds and an exact k-mer matching increases the programs sensitivity as there are more results found.
Do you have an explanation for this results?