salvoc81 commented 4 years ago

I have noticed that in recent versions (from v10) the number of hits generated by search can be substantially different when not indexing the input databases (especially the target).

It should be noted that also the execution times are substantially different.

I tested most of the possibles combinations using a single CPUs, and symmarized the results in the table below:

Alignment	count	seconds
a-a	7199	7.43
b-b	16937	12.30
a-b	4844	7.89
b-a	4963	10.20
a-a	10392	24.73
b-b	24450	43.13
a-b	8621	27.14
b-a	8710	38.44
a*-b	4848	7.86
b-a*	8710	37.89
a-b*	8621	26.79
b*-a	4963	10.30
(single) a-b*	8155	26.38

*: indexed DB single: test in which the --alt-ali option was not used (it seem ok)

In the above test I am using -s 7.5, and should be noted that the difference ar much higher when decreasing the sensitivity. It should also be noticed that I am using the blosom62 matrix in the filtering step.

I need to run an experiment with thousands of proteomes and would impossible to store all the indexes in advance.
It would great if you could help in mitigating the effect of non-indexing the target DBs

Expected Behavior

Same results (or at least not too different)

Current Behavior

The number of hits in output are sometimes the half (at the highest sensitivity), and it gets worse at lower sensitivities

Steps to Reproduce (for bugs)

Template commands:

createdb

mmseqs createdb <in_name> <in_name.db> -v 0

indexdb

mmseqs createindex <in_name.db> <tmp_dir_in_name> --threads 2 --search-type 1 -v 0

search

mmseqs search <in_name1.db> <in_name2.db> <raw_out_1-2> <tmp_1-2> -s 7.5 --threads 1 -v 0 --search-type 1 --seed-sub-mat nucl:nucleotide.out,aa:blosum62.out --min-ungapped-score 15 --alignment-mode 2 --alt-ali 10

convertalis

mmseqs convertalis <in_name1.db> <in_name2.db> <raw_out_1-2> <blast_out_1-2> -v 0 --format-mode 0 --search-type 1 --format-output query,target,qstart,qend,tstart,tend,bits

Your Environment

Include as many relevant details about the environment you experienced the bug in.

MMseqs Version: e1a1c1226ef22ac3d0da8e8f71adb8fd2388a249 (but also happens with version 10)
X86 AVX2 Linux binaries
Operating system and version: tested on different Linux distributions

martin-steinegger commented 4 years ago

@salvoc81 thank you for sharing this results. Very interesting! If you index a database with MMseqs2 then all k-mers are stored if no sensitivity -s is provided to createindex. However, if you search without an index then only k-mers above a certain blosom62 threshold, defined by -s, are indexed. But it might be possible that these rejected k-mers might can be useful since compositional bias correction (--comp-bias-corr) can produce results with lower score than the reject k-mers. In our benchmarks this had no measurable effect. You could test if this is the causes of the disparity by providing the same -s as used for the search to createindex.

Do you have a small example that I could use the reproduce this issue?

Some remark: MMseqs2 ignores indexes on the query site.

salvoc81 commented 4 years ago

@martin-steinegger I have tested by passing the same -s when creating the indexes, following are the results:

Using same -s as search in createindex

Alignment	count	seconds
a-a	10107	23.81
b-b	23206	42.43
a-b	8155	26.46
b-a	8390	37.12

They are just slightly different.

If you index a database with MMseqs2 then all k-mers are stored if no sensitivity -s is provided to createindex

Actually I thought that could have been the problem.

In the early versions of MMseqs I had noticed the difference when running without the indexed DBs, but it was not that much, and the only side-effect was a slight increase in the overall execution time (maybe 10~20% slower). Nevertheless, now it runs faster and matches less, which is caused from what you explained about I guess.

Do you have a small example that I could use the reproduce this issue?

Yes, get the 2 small proteomes I used from the following link
https://send.firefox.com/download/8d4ac7f72e90671b/#ioryCshD4vIZCAPxd30CCw

I will do another couple of tests to see if I can increase the accuracy when no indexing is performed.

UPDATE @martin-steinegger I think just found the problem...
When running search without selecting the matrix for pre-filtering the number of hits, as well as the running times, go back to what expect. The differences are caused in this case by the use of the default VTML in the prefiltering step.

As you can see from the following table, the results are much more reasonable.

Without --seed-sub-mat nucl:nucleotide.out,aa:blosum62.ou in search

Alignment	count	seconds
a-a	10209	29.87
b-b	23523	52.05
a-b	8281	32.13
b-a	8533	45.62

I confirm this is only happening when using blosum62 in the prefilter step

milot-mirdita commented 4 years ago

One more thing to try (sorry I lost track of this issue). You should pass the same substitution matrix to both search and createindex. Does that also result in something weird?

salvoc81 commented 4 years ago

Hello @milot-mirdita and @martin-steinegger .
Sorry if it took me some time to extra testing.

As Milot was suggesting the problem happens when createindex and search are not set to use the Matrix. Following I am showing the results alignments of a proteome against itself, using different combinations of of VTML80 and blosum62 for createindex and search.

Pair	createindex	search	count	seconds
a-a	blosum62	blosum62	10205	17.11
a-a	VTML80	blosum62	13962	91.36
a-a	VTML80	VTML80	14268	98.56
a-a	blosum62	VTML80	10709	16.5
a-a	VTML40	VTML40	14032	105.10

Same alignments without indexing

Pair	createindex	search	count	seconds
a-a	none	VTML80	14268	69.96
a-a	none	blosum62	10205	13.66

As you can see from the second line, the results are same as in the first line of the fist table (in which only blosum62 was used).

I guess this solves the issue, and I am happy we found the problem :)
Nevertheless, it would be very useful to have some kind of warning or even better, error message to avoid such things to happen (unless it is not the user's decision, in which case a "--force-submat" parameter might be handy).

Also, as I understand, among the BLOSUM matrixes only blosum62 can be set at present, while different VTML matrixes can be set.

Could you please point me to somewhere I can see which MATRIXES can be used?
Most matrixes files are under the data directory, but many did not work in my tests.

soedinglab / MMseqs2

Very different different results with and without indexing the target DBs #318

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Template commands:

Your Environment

Same alignments without indexing