soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

No match between two nucleotide sequences despite high identity #327

Open padix-key opened 4 years ago

padix-key commented 4 years ago

I have two nucleotide sequences: The query is an ORF, the target is a truncated genome sequence, containing an homologous ORF with a high identity to the query.

Despite the high similarity, mmseqs easy-search does not find any matches, even at high sensitivity (7.5). If I set the k-mer length to 5, a fitting match is found, though the alignment does not span the entire ORF, but misses the first 5 bases in both sequences.

Expected Behavior

An alignment like this (Please ignore the sequence numbering):

sphx_glr_sw_genome_search_001

Current Behavior

  1. no k-mer setting -> no match
  2. -k = 5 -> match with 5 bases missing in the beginning
    query   target  77  0.792   6.633E-19   6   87  6   87  TCACATCGTTCGCTTTATCGGTCTACTACTACTAAACGCATCTTCTTTGCGCGGTAGACGAGTGAGCGGCATCCAGCATTAA  ACATATCGTTCGTTTCACTGGGCTACTACTACTCAACGCATTTATTGTGCGCGGTAGACCGGTGGGCGGCATTCAACATTAA

Steps to Reproduce (for bugs)

mmseqs easy-search query.fasta target.fasta results.out /tmp --format-output query,target,raw,pident,evalue,qstart,qend,tstart,tend,qaln,taln --search-type 3 -s 7.5

and

mmseqs easy-search query.fasta target.fasta results.out /tmp --format-output query,target,raw,pident,evalue,qstart,qend,tstart,tend,qaln,taln --search-type 3 -s 7.5 -k 5

with query.fasta:

>query
ATGACTCACATCGTTCGCTTTATCGGTCTACTACTACTAAACGCATCTTCTTTGCGCGGTAGACGAGTGAGCGGCATCCA
GCATTAA

target.fasta:

>target
ATGTCACATATCGTTCGTTTCACTGGGCTACTACTACTCAACGCATTTATTGTGCGCGGTAGACCGGTGGGCGGCATTCA
ACATTAAGTCAGCTCGAAGTCAAACAAAACCCGCGCCGTTGCGCGGGTTTTTTTATGCCTGACGCAAGGCGCCCCTGGAG
ACAAGGACCACATCATGAGCCAGCAAG

MMseqs Output

  1. https://gist.github.com/padix-key/e34c9ddd4a1d59d9cbc84b9ef4bbd20e
  2. https://gist.github.com/padix-key/c69d4830ab77a29f5ea41fbb0eb3a0db

Your Environment

milot-mirdita commented 4 years ago

MMseqs2 uses by default spaced double consecutive 15-mer matches for prefiltering in the nucleotide search. Since they are spaced (i.e. are actually 22-mers with 15 informative and 7 ignored position) we can still find non-exact matches.

For your sequences this is still much to long of a k-mer and it did not stand a chance to find a match. The 5-mers (actually 11 mer with 5 informative and 6 ignored positions) were able to find this, however this will increase the number of false positive matches that have to rejected by the alignment stage by a lot and therefore slow down the search by quite a bit.

You can play with k-mers and k-mer pattern parameters (-k, --spaced-kmer-mode, --spaced-kmer-pattern).

-s doesn't do anything for nucleotide searches we don't have real substitution matrices so we cannot generate similar k-mers.

You might want to do a translated-translated search (--search-type 2) though, there you will probably be able to find a lot more since synonymous mutations disappear and we can generate similar k-mers (-s sensitivity parameter).

I am not 100% sure why the alignment is truncated though since I am not super familiar with the nucleotide search code, but It might be to the --zdrop parameter which stops nucleotide alignments early if they become to bad. See the documentation of minimap2 for how it works (https://lh3.github.io/minimap2/minimap2.html, we use the alignment algorithm implemented by minimap2).

padix-key commented 4 years ago

Thank you for your response.

-s doesn't do anything for nucleotide searches we don't have real substitution matrices so we cannot generate similar k-mers.

I think the wiki misses this detail. The mmseqs manual even suggests that a substitution matrix is used for nucleotide k-mer generation (the part with nucl:nucleotide.out):

--seed-sub-mat MAT           Substitution matrix file for k-mer generation [nucl:nucleotide.out,aa:VTML80.out]

but It might be to the --zdrop parameter which stops nucleotide alignments early

Unfortunately, increasing --zdrop does not fix the truncation. Actually, extending the alignment to the left should even increase the score because the next position would contain a match (C with C).