Open salvoc81 opened 4 years ago
@salvoc81 thank you for sharing this results. Very interesting!
If you index a database with MMseqs2 then all k-mers are stored if no sensitivity -s
is provided to createindex
. However, if you search without an index then only k-mers above a certain blosom62
threshold, defined by -s
, are indexed. But it might be possible that these rejected k-mers might can be useful since compositional bias correction (--comp-bias-corr
) can produce results with lower score than the reject k-mers. In our benchmarks this had no measurable effect. You could test if this is the causes of the disparity by providing the same -s
as used for the search to createindex
.
Do you have a small example that I could use the reproduce this issue?
Some remark: MMseqs2 ignores indexes on the query site.
@martin-steinegger I have tested by passing the same -s
when creating the indexes, following are the results:
Using same
-s
assearch
increateindex
Alignment | count | seconds |
---|---|---|
a-a | 10107 | 23.81 |
b-b | 23206 | 42.43 |
a-b | 8155 | 26.46 |
b-a | 8390 | 37.12 |
They are just slightly different.
If you index a database with MMseqs2 then all k-mers are stored if no sensitivity -s is provided to createindex
Actually I thought that could have been the problem.
In the early versions of MMseqs I had noticed the difference when running without the indexed DBs, but it was not that much, and the only side-effect was a slight increase in the overall execution time (maybe 10~20% slower). Nevertheless, now it runs faster and matches less, which is caused from what you explained about I guess.
Do you have a small example that I could use the reproduce this issue?
Yes, get the 2 small proteomes I used from the following link
https://send.firefox.com/download/8d4ac7f72e90671b/#ioryCshD4vIZCAPxd30CCw
I will do another couple of tests to see if I can increase the accuracy when no indexing is performed.
UPDATE
@martin-steinegger I think just found the problem...
When running search
without selecting the matrix for pre-filtering the number of hits, as well as the running times, go back to what expect. The differences are caused in this case by the use of the default VTML in the prefiltering step.
As you can see from the following table, the results are much more reasonable.
Without
--seed-sub-mat nucl:nucleotide.out,aa:blosum62.ou
insearch
Alignment | count | seconds |
---|---|---|
a-a | 10209 | 29.87 |
b-b | 23523 | 52.05 |
a-b | 8281 | 32.13 |
b-a | 8533 | 45.62 |
I confirm this is only happening when using blosum62 in the prefilter
step
One more thing to try (sorry I lost track of this issue). You should pass the same substitution matrix to both search
and createindex
. Does that also result in something weird?
Hello @milot-mirdita and @martin-steinegger .
Sorry if it took me some time to extra testing.
As Milot was suggesting the problem happens when createindex
and search
are not set to use the Matrix.
Following I am showing the results alignments of a proteome against itself, using different combinations of of VTML80 and blosum62 for createindex
and search
.
Pair | createindex | search | count | seconds |
---|---|---|---|---|
a-a | blosum62 | blosum62 | 10205 | 17.11 |
a-a | VTML80 | blosum62 | 13962 | 91.36 |
a-a | VTML80 | VTML80 | 14268 | 98.56 |
a-a | blosum62 | VTML80 | 10709 | 16.5 |
a-a | VTML40 | VTML40 | 14032 | 105.10 |
Pair | createindex | search | count | seconds |
---|---|---|---|---|
a-a | none | VTML80 | 14268 | 69.96 |
a-a | none | blosum62 | 10205 | 13.66 |
As you can see from the second line, the results are same as in the first line of the fist table (in which only blosum62 was used).
I guess this solves the issue, and I am happy we found the problem :)
Nevertheless, it would be very useful to have some kind of warning or even better, error message to avoid such things to happen (unless it is not the user's decision, in which case a "--force-submat" parameter might be handy).
Also, as I understand, among the BLOSUM
matrixes only blosum62
can be set at present, while different VTML matrixes can be set.
Could you please point me to somewhere I can see which MATRIXES can be used?
Most matrixes files are under the data
directory, but many did not work in my tests.
I have noticed that in recent versions (from v10) the number of hits generated by
search
can be substantially different when not indexing the input databases (especially the target).It should be noted that also the execution times are substantially different.
I tested most of the possibles combinations using a single CPUs, and symmarized the results in the table below:
In the above test I am using
-s 7.5
, and should be noted that the difference ar much higher when decreasing the sensitivity. It should also be noticed that I am using theblosom62
matrix in the filtering step.I need to run an experiment with thousands of proteomes and would impossible to store all the indexes in advance.
It would great if you could help in mitigating the effect of non-indexing the target DBs
Expected Behavior
Same results (or at least not too different)
Current Behavior
The number of hits in output are sometimes the half (at the highest sensitivity), and it gets worse at lower sensitivities
Steps to Reproduce (for bugs)
Template commands:
mmseqs createdb <in_name> <in_name.db> -v 0
mmseqs createindex <in_name.db> <tmp_dir_in_name> --threads 2 --search-type 1 -v 0
mmseqs search <in_name1.db> <in_name2.db> <raw_out_1-2> <tmp_1-2> -s 7.5 --threads 1 -v 0 --search-type 1 --seed-sub-mat nucl:nucleotide.out,aa:blosum62.out --min-ungapped-score 15 --alignment-mode 2 --alt-ali 10
mmseqs convertalis <in_name1.db> <in_name2.db> <raw_out_1-2> <blast_out_1-2> -v 0 --format-mode 0 --search-type 1 --format-output query,target,qstart,qend,tstart,tend,bits
Your Environment
Include as many relevant details about the environment you experienced the bug in.