Open padix-key opened 4 years ago
MMseqs2 uses by default spaced double consecutive 15-mer matches for prefiltering in the nucleotide search. Since they are spaced (i.e. are actually 22-mers with 15 informative and 7 ignored position) we can still find non-exact matches.
For your sequences this is still much to long of a k-mer and it did not stand a chance to find a match. The 5-mers (actually 11 mer with 5 informative and 6 ignored positions) were able to find this, however this will increase the number of false positive matches that have to rejected by the alignment stage by a lot and therefore slow down the search by quite a bit.
You can play with k-mers and k-mer pattern parameters (-k
, --spaced-kmer-mode
, --spaced-kmer-pattern
).
-s
doesn't do anything for nucleotide searches we don't have real substitution matrices so we cannot generate similar k-mers.
You might want to do a translated-translated search (--search-type 2
) though, there you will probably be able to find a lot more since synonymous mutations disappear and we can generate similar k-mers (-s
sensitivity parameter).
I am not 100% sure why the alignment is truncated though since I am not super familiar with the nucleotide search code, but It might be to the --zdrop
parameter which stops nucleotide alignments early if they become to bad. See the documentation of minimap2 for how it works (https://lh3.github.io/minimap2/minimap2.html, we use the alignment algorithm implemented by minimap2).
Thank you for your response.
-s doesn't do anything for nucleotide searches we don't have real substitution matrices so we cannot generate similar k-mers.
I think the wiki misses this detail. The mmseqs
manual even suggests that a substitution matrix is used for nucleotide k-mer generation (the part with nucl:nucleotide.out
):
--seed-sub-mat MAT Substitution matrix file for k-mer generation [nucl:nucleotide.out,aa:VTML80.out]
but It might be to the --zdrop parameter which stops nucleotide alignments early
Unfortunately, increasing --zdrop
does not fix the truncation. Actually, extending the alignment to the left should even increase the score because the next position would contain a match (C
with C
).
I have two nucleotide sequences: The query is an ORF, the target is a truncated genome sequence, containing an homologous ORF with a high identity to the query.
Despite the high similarity,
mmseqs easy-search
does not find any matches, even at high sensitivity (7.5
). If I set the k-mer length to5
, a fitting match is found, though the alignment does not span the entire ORF, but misses the first 5 bases in both sequences.Expected Behavior
An alignment like this (Please ignore the sequence numbering):
Current Behavior
-k = 5
-> match with 5 bases missing in the beginningSteps to Reproduce (for bugs)
mmseqs easy-search query.fasta target.fasta results.out /tmp --format-output query,target,raw,pident,evalue,qstart,qend,tstart,tend,qaln,taln --search-type 3 -s 7.5
and
mmseqs easy-search query.fasta target.fasta results.out /tmp --format-output query,target,raw,pident,evalue,qstart,qend,tstart,tend,qaln,taln --search-type 3 -s 7.5 -k 5
with
query.fasta
:target.fasta
:MMseqs Output
Your Environment