soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.42k stars 195 forks source link

BLAST tab format does not report gap openings (potentially wrong bit score) #792

Open sarahet opened 10 months ago

sarahet commented 10 months ago

Hi - I ran the easy-search command (newest release, 6f452, self-compiled with cmake 3.25.2 and gcc 12.2.0) on a Linux system with the example query and database file from the MMseqs2 repository and encountered an unexpected behaviour. The BLAST tab output format never includes gap openings (even though many alignments should contain them) and the bit score deviates from what e.g. BLAST reports for the same alignments. To give an example, the pretty html output contains the following alignment for query C0QTH6 and subject A0A059IV01:

Q 0 IGAGVAIGAAAGGGAAGLGNAIRGVLEGMARNPNMGGKLLTTMFIGMALIETFVLYGLLIAIIFLF
T 0 IAIGLAVFGMLGAG-LGIANIFSAMLNGIARNPESEGKMKSYVYIGAAMVEIMGLLAFVLAMLLIF

However, when running the same search but producing a BLAST tab output, the gap is not present (i.e. 0 gap openings):

C0QTH6  A0A059IV01  0.333  66  43   0   43  108 9   73  3.416E-07  48

BLAST produces basically the same alignment (the gap is moved by one but it should create the same score due to the pairing):

Query  43   IGAGVAIGAAAGGGAAGLGNAIRGVLEGMARNPNMGGKLLTTMFIGMALIETFVLYGLLIAIIFLF  108
            I  G+A+    G G  G+ N    +L G+ARNP   GK+ + ++IG A++E   L   ++A++ +F
Sbjct  9    IAIGLAVFGMLGAGL-GIANIFSAMLNGIARNPESEGKMKSYVYIGAAMVEIMGLLAFVLAMLLIF  73

Here, the BLAST tab format also includes the gap and it leads to a different bit score compared to MMseqs2.

C0QTH6  A0A059IV01  33.333  66  43  1   43  108 9   73  4.79e-08  46.2

My exact calls (for the BLAST tab output) are:

mmseqs createdb DB.fasta db_mmseqs2

mmseqs easy-search \
QUERY.fasta \
db_mmseqs2 \
out.m8 \
tmp \
-s 7.5 \
--search-type 3 \
--threads 4 \
--max-seqs 10000 \
--comp-bias-corr 0 \
--comp-bias-corr-scale 0 \
-e 1
milot-mirdita commented 10 months ago

This is a bit of a common footgun with MMseqs2: https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity

MMseqs2 by default only determines accurate bit-scores and E-values and does not compute the full alignment, instead of just estimates sequence identity from the bit-score.

You have to either pass -a or --alignment-mode 3 to compute full backtraces and therefore accurate sequence-identiteis and gap-opens etc.

sarahet commented 10 months ago

Thanks a lot for your reply. I did not set this option because according to the help message, I thought the --alignment-mode 3 is set by default:

 --alignment-mode INT             How to compute the alignment:
                                  0: automatic
                                  1: only score and end_pos
                                  2: also start_pos and cov
                                  3: also seq.id
                                  4: only ungapped alignment [3]

When I run the command as I did (without explicitly setting it), I also get the following output:

MMseqs Version:                         6f45232ac8daca14e354ae320a4359056ec524c2
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           false
Alignment mode                          3
Alignment mode                          0

This is the same when I manually set --alignment-mode 3. Can you maybe help me understand why there are 2 alignment mode parameters in the output that differ (i.e. are there 2 different parameters and I forgot to set one)? I also get the same results and bit scores (such as the example above) when I run the command as I reported above, when I manually set --alignment-mode 3 or when I additionally set -a. I would appreciate any help!

milot-mirdita commented 10 months ago

I think the important part is to set -a. Me mentioning --alignment-mode was a bit misleading.

The gap open count is computed based on the presence of a backtrace, which we don't compute by default.

sarahet commented 10 months ago

I see, thanks. When I run the command with -a I get the following verbose output:

MMseqs Version:                         6f45232ac8daca14e354ae320a4359056ec524c2
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           true
Alignment mode                          3
Alignment mode                          0

so the backtrace seems to be enabled. However, I still get exactly the same results and bit scores as before unfortunately. Am I missing another parameter here?