No match between two nucleotide sequences despite high identity

padix-key commented 4 years ago

I have two nucleotide sequences: The query is an ORF, the target is a truncated genome sequence, containing an homologous ORF with a high identity to the query.

Despite the high similarity, mmseqs easy-search does not find any matches, even at high sensitivity (7.5). If I set the k-mer length to 5, a fitting match is found, though the alignment does not span the entire ORF, but misses the first 5 bases in both sequences.

Expected Behavior

An alignment like this (Please ignore the sequence numbering):

sphx_glr_sw_genome_search_001

Current Behavior

no k-mer setting -> no match

-k = 5 -> match with 5 bases missing in the beginning

query   target  77  0.792   6.633E-19   6   87  6   87  TCACATCGTTCGCTTTATCGGTCTACTACTACTAAACGCATCTTCTTTGCGCGGTAGACGAGTGAGCGGCATCCAGCATTAA  ACATATCGTTCGTTTCACTGGGCTACTACTACTCAACGCATTTATTGTGCGCGGTAGACCGGTGGGCGGCATTCAACATTAA

Steps to Reproduce (for bugs)

mmseqs easy-search query.fasta target.fasta results.out /tmp --format-output query,target,raw,pident,evalue,qstart,qend,tstart,tend,qaln,taln --search-type 3 -s 7.5

and

mmseqs easy-search query.fasta target.fasta results.out /tmp --format-output query,target,raw,pident,evalue,qstart,qend,tstart,tend,qaln,taln --search-type 3 -s 7.5 -k 5

with query.fasta:

>query
ATGACTCACATCGTTCGCTTTATCGGTCTACTACTACTAAACGCATCTTCTTTGCGCGGTAGACGAGTGAGCGGCATCCA
GCATTAA

target.fasta:

>target
ATGTCACATATCGTTCGTTTCACTGGGCTACTACTACTCAACGCATTTATTGTGCGCGGTAGACCGGTGGGCGGCATTCA
ACATTAAGTCAGCTCGAAGTCAAACAAAACCCGCGCCGTTGCGCGGGTTTTTTTATGCCTGACGCAAGGCGCCCCTGGAG
ACAAGGACCACATCATGAGCCAGCAAG

MMseqs Output

Your Environment

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 11.e1a1c
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): Conda
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): SSE4.1 and AVX2 is supported, 16 GB RAM
Operating system and version: Ubuntu 18.04

milot-mirdita commented 4 years ago

MMseqs2 uses by default spaced double consecutive 15-mer matches for prefiltering in the nucleotide search. Since they are spaced (i.e. are actually 22-mers with 15 informative and 7 ignored position) we can still find non-exact matches.

For your sequences this is still much to long of a k-mer and it did not stand a chance to find a match. The 5-mers (actually 11 mer with 5 informative and 6 ignored positions) were able to find this, however this will increase the number of false positive matches that have to rejected by the alignment stage by a lot and therefore slow down the search by quite a bit.

You can play with k-mers and k-mer pattern parameters (-k, --spaced-kmer-mode, --spaced-kmer-pattern).

-s doesn't do anything for nucleotide searches we don't have real substitution matrices so we cannot generate similar k-mers.

You might want to do a translated-translated search (--search-type 2) though, there you will probably be able to find a lot more since synonymous mutations disappear and we can generate similar k-mers (-s sensitivity parameter).

I am not 100% sure why the alignment is truncated though since I am not super familiar with the nucleotide search code, but It might be to the --zdrop parameter which stops nucleotide alignments early if they become to bad. See the documentation of minimap2 for how it works (https://lh3.github.io/minimap2/minimap2.html, we use the alignment algorithm implemented by minimap2).

padix-key commented 4 years ago

Thank you for your response.

-s doesn't do anything for nucleotide searches we don't have real substitution matrices so we cannot generate similar k-mers.

I think the wiki misses this detail. The mmseqs manual even suggests that a substitution matrix is used for nucleotide k-mer generation (the part with nucl:nucleotide.out):

--seed-sub-mat MAT           Substitution matrix file for k-mer generation [nucl:nucleotide.out,aa:VTML80.out]

but It might be to the --zdrop parameter which stops nucleotide alignments early

Unfortunately, increasing --zdrop does not fix the truncation. Actually, extending the alignment to the left should even increase the score because the next position would contain a match (C with C).

soedinglab / MMseqs2