How to deal with short sequences

soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

GNU General Public License v3.0

1.39k stars 195 forks source link

MMSeqs2 considers only hits with two k-mers on a diagonal. In default MMSeqs2 uses spaced k-mers of length 13 (mask: 11010110011). This means that we can only find hits of at least 14 length. It is possible to turn off spaced k-mers --spaced-kmer-mode 0, this makes it possible to detect 8 residue long sequences. You could also define your own more compact spaced pattern using --spaced-kmer-pattern. Another option is to decrease the k-mer length using -k.

Changing the substitution matrix also helps to detect shorter sequences. You might want to look into the publication "Selecting the Right Similarity-Scoring Matrix" from Pearson et al (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/)

soedinglab / MMseqs2

How to deal with short sequences #125