Closed josemduarte closed 4 years ago
MMSeqs2 considers only hits with two k-mers on a diagonal. In default MMSeqs2 uses spaced k-mers of length 13 (mask: 11010110011). This means that we can only find hits of at least 14 length. It is possible to turn off spaced k-mers --spaced-kmer-mode 0
, this makes it possible to detect 8 residue long sequences. You could also define your own more compact spaced pattern using --spaced-kmer-pattern
. Another option is to decrease the k-mer length using -k
.
Changing the substitution matrix also helps to detect shorter sequences. You might want to look into the publication "Selecting the Right Similarity-Scoring Matrix" from Pearson et al (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/)
I'm finding it difficult to get hits for short sequences (around 10 aminoacids in length). I've already raised the evalue cutoff (-e) which helped in including more hits. However I'm still not getting as many hits as I would expect.
For blast the strategy is to use other substitution matrices for short sequences, see:
https://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html
Would that strategy also apply to mmseqs? Or is there another recommended strategy?