soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

Very different different results with and without indexing the target DBs #318

Open salvoc81 opened 4 years ago

salvoc81 commented 4 years ago

I have noticed that in recent versions (from v10) the number of hits generated by search can be substantially different when not indexing the input databases (especially the target).

It should be noted that also the execution times are substantially different.

I tested most of the possibles combinations using a single CPUs, and symmarized the results in the table below:

Alignment count seconds
a-a 7199 7.43
b-b 16937 12.30
a-b 4844 7.89
b-a 4963 10.20
a-a 10392 24.73
b-b 24450 43.13
a-b 8621 27.14
b-a 8710 38.44
a*-b 4848 7.86
b-a* 8710 37.89
a-b* 8621 26.79
b*-a 4963 10.30
(single) a-b* 8155 26.38

*: indexed DB single: test in which the --alt-ali option was not used (it seem ok)

In the above test I am using -s 7.5, and should be noted that the difference ar much higher when decreasing the sensitivity. It should also be noticed that I am using the blosom62 matrix in the filtering step.

I need to run an experiment with thousands of proteomes and would impossible to store all the indexes in advance.
It would great if you could help in mitigating the effect of non-indexing the target DBs

Expected Behavior

Same results (or at least not too different)

Current Behavior

The number of hits in output are sometimes the half (at the highest sensitivity), and it gets worse at lower sensitivities

Steps to Reproduce (for bugs)

Template commands:

createdb

mmseqs createdb <in_name> <in_name.db> -v 0

indexdb

mmseqs createindex <in_name.db> <tmp_dir_in_name> --threads 2 --search-type 1 -v 0

search

mmseqs search <in_name1.db> <in_name2.db> <raw_out_1-2> <tmp_1-2> -s 7.5 --threads 1 -v 0 --search-type 1 --seed-sub-mat nucl:nucleotide.out,aa:blosum62.out --min-ungapped-score 15 --alignment-mode 2 --alt-ali 10

convertalis

mmseqs convertalis <in_name1.db> <in_name2.db> <raw_out_1-2> <blast_out_1-2> -v 0 --format-mode 0 --search-type 1 --format-output query,target,qstart,qend,tstart,tend,bits

Your Environment

Include as many relevant details about the environment you experienced the bug in.

martin-steinegger commented 4 years ago

@salvoc81 thank you for sharing this results. Very interesting! If you index a database with MMseqs2 then all k-mers are stored if no sensitivity -s is provided to createindex. However, if you search without an index then only k-mers above a certain blosom62 threshold, defined by -s, are indexed. But it might be possible that these rejected k-mers might can be useful since compositional bias correction (--comp-bias-corr) can produce results with lower score than the reject k-mers. In our benchmarks this had no measurable effect. You could test if this is the causes of the disparity by providing the same -s as used for the search to createindex.

Do you have a small example that I could use the reproduce this issue?

Some remark: MMseqs2 ignores indexes on the query site.

salvoc81 commented 4 years ago

@martin-steinegger I have tested by passing the same -s when creating the indexes, following are the results:

Using same -s as search in createindex

Alignment count seconds
a-a 10107 23.81
b-b 23206 42.43
a-b 8155 26.46
b-a 8390 37.12

They are just slightly different.

If you index a database with MMseqs2 then all k-mers are stored if no sensitivity -s is provided to createindex

Actually I thought that could have been the problem.

In the early versions of MMseqs I had noticed the difference when running without the indexed DBs, but it was not that much, and the only side-effect was a slight increase in the overall execution time (maybe 10~20% slower). Nevertheless, now it runs faster and matches less, which is caused from what you explained about I guess.

Do you have a small example that I could use the reproduce this issue?

Yes, get the 2 small proteomes I used from the following link
https://send.firefox.com/download/8d4ac7f72e90671b/#ioryCshD4vIZCAPxd30CCw

I will do another couple of tests to see if I can increase the accuracy when no indexing is performed.

UPDATE @martin-steinegger I think just found the problem...
When running search without selecting the matrix for pre-filtering the number of hits, as well as the running times, go back to what expect. The differences are caused in this case by the use of the default VTML in the prefiltering step.

As you can see from the following table, the results are much more reasonable.

Without --seed-sub-mat nucl:nucleotide.out,aa:blosum62.ou in search

Alignment count seconds
a-a 10209 29.87
b-b 23523 52.05
a-b 8281 32.13
b-a 8533 45.62

I confirm this is only happening when using blosum62 in the prefilter step

milot-mirdita commented 4 years ago

One more thing to try (sorry I lost track of this issue). You should pass the same substitution matrix to both search and createindex. Does that also result in something weird?

salvoc81 commented 4 years ago

Hello @milot-mirdita and @martin-steinegger .
Sorry if it took me some time to extra testing.

As Milot was suggesting the problem happens when createindex and search are not set to use the Matrix. Following I am showing the results alignments of a proteome against itself, using different combinations of of VTML80 and blosum62 for createindex and search.

Pair createindex search count seconds
a-a blosum62 blosum62 10205 17.11
a-a VTML80 blosum62 13962 91.36
a-a VTML80 VTML80 14268 98.56
a-a blosum62 VTML80 10709 16.5
a-a VTML40 VTML40 14032 105.10

Same alignments without indexing

Pair createindex search count seconds
a-a none VTML80 14268 69.96
a-a none blosum62 10205 13.66

As you can see from the second line, the results are same as in the first line of the fist table (in which only blosum62 was used).

I guess this solves the issue, and I am happy we found the problem :)
Nevertheless, it would be very useful to have some kind of warning or even better, error message to avoid such things to happen (unless it is not the user's decision, in which case a "--force-submat" parameter might be handy).

Also, as I understand, among the BLOSUM matrixes only blosum62 can be set at present, while different VTML matrixes can be set.

Could you please point me to somewhere I can see which MATRIXES can be used?
Most matrixes files are under the data directory, but many did not work in my tests.