soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

How to deal with short sequences #125

Closed josemduarte closed 4 years ago

josemduarte commented 6 years ago

I'm finding it difficult to get hits for short sequences (around 10 aminoacids in length). I've already raised the evalue cutoff (-e) which helped in including more hits. However I'm still not getting as many hits as I would expect.

For blast the strategy is to use other substitution matrices for short sequences, see:

https://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html

Would that strategy also apply to mmseqs? Or is there another recommended strategy?

martin-steinegger commented 4 years ago

MMSeqs2 considers only hits with two k-mers on a diagonal. In default MMSeqs2 uses spaced k-mers of length 13 (mask: 11010110011). This means that we can only find hits of at least 14 length. It is possible to turn off spaced k-mers --spaced-kmer-mode 0, this makes it possible to detect 8 residue long sequences. You could also define your own more compact spaced pattern using --spaced-kmer-pattern. Another option is to decrease the k-mer length using -k.

Changing the substitution matrix also helps to detect shorter sequences. You might want to look into the publication "Selecting the Right Similarity-Scoring Matrix" from Pearson et al (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/)