steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
776 stars 99 forks source link

`alntmscore` output is wrong #312

Open ekiefl opened 2 months ago

ekiefl commented 2 months ago

Expected Behavior

I expect alntmscore to be different from ttmscore.

Current Behavior

All alntmscore values are equal to ttmscore.

Steps to Reproduce (for bugs)

Here is a zipped directory of 75 structures in pdb format:

structures.zip

Unzip this directory. Then perform a search and convert the alignment.

rm -rf tmp
foldseek createdb ./structures/ targetDB
foldseek search targetDB targetDB alnDB tmp -a --exhaustive-search
foldseek convertalis targetDB targetDB alnDB output.txt --format-mode 4 --format-output query,target,qlen,tlen,alnlen,qtmscore,ttmscore,alntmscore,cigar,qseq,tseq

In a python environment with pandas, confirm that all alntmscore equal all ttmscore and sample 20 hits to be printed:

import pandas as pd

df = pd.read_csv("output.txt", sep="\t")

# This assert passes, meaning every ttmscore equals every alntmscore
assert (df.ttmscore == df.alntmscore).all()

print(
    df[["query", "target", "qlen", "tlen", "alnlen", "qtmscore", "ttmscore", "alntmscore"]]
    .sample(20)
    .tail(10)
    .to_markdown()
)

Output of script:

query target qlen tlen alnlen qtmscore ttmscore alntmscore
1187 B0RXV1 V7BU96 227 563 206 0.5245 0.2354 0.2354
5406 U5U2L0 V4KAC2 550 599 560 0.6095 0.5638 0.5638
3335 Q5F9Z5 Q9XZT6 206 250 234 0.6303 0.5362 0.5362
861 B0BUU8 Q20230 203 191 210 0.5333 0.5591 0.5591
842 B0BUU8 B0RXV1 203 227 199 0.8489 0.7633 0.7633
2810 K0F1X4 B1JTS0 178 206 211 0.6423 0.5692 0.5692
2927 L8IGY7 P48769 256 260 254 0.9265 0.9125 0.9125
4925 Q8R9S6 Q834T6 206 226 245 0.4625 0.4305 0.4305
3861 W5N438 B8F7G0 264 208 255 0.5412 0.6642 0.6642
960 B0K119 C5A558 203 190 218 0.5143 0.5417 0.5417

Foldseek Output (for bugs)

Full Foldseek standard out:

targetDB exists and will be overwritten
createdb ./structures/ targetDB

MMseqs Version:         9.427df8a
Path to ProstT5
Chain name mode         0
Write mapping file      0
Mask b-factor threshold 0
Coord store mode        2
Write lookup file       1
Input format            0
File Inclusion Regex    .*
File Exclusion Regex    ^$
Threads                 14
Verbosity               3

Output file: targetDB
[=================================================================] 100.00% 75 0s 11ms
Time for merging to targetDB_ss: 0h 0m 0s 1ms
Time for merging to targetDB_h: 0h 0m 0s 1ms
Time for merging to targetDB_ca: 0h 0m 0s 1ms
Time for merging to targetDB: 0h 0m 0s 1ms
Ignore 0 out of 75.
Too short: 0, incorrect: 0, not proteins: 0.
Time for processing: 0h 0m 0s 65ms
Create directory tmp
search targetDB targetDB alnDB tmp -a --exhaustive-search

MMseqs Version:                 9.427df8a
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Max reject                      2147483647
Max accept                      2147483647
Add backtrace                   true
TMscore threshold               0
TMalign hit order               0
TMalign fast                    1
Preload mode                    0
Threads                         14
Verbosity                       3
LDDT threshold                  0
Sort by structure bit score     1
Alignment type                  2
Exact TMscore                   0
Substitution matrix             aa:3di.out,nucl:3di.out
Alignment mode                  3
Alignment mode                  0
E-value threshold               10
Min alignment length            0
Seq. id. mode                   0
Alternative alignments          0
Max sequence length             65535
Compositional bias              1
Compositional bias              1
Gap open cost                   aa:10,nucl:10
Gap extension cost              aa:1,nucl:1
Compressed                      0
Seed substitution matrix        aa:3di.out,nucl:3di.out
Sensitivity                     9.5
k-mer length                    0
Target search mode              0
k-score                         seq:2147483647,prof:2147483647
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   0
Mask residues probability       0.99995
Mask lower case residues        1
Minimum diagonal score          30
Selected taxa
Spaced k-mers                   1
Spaced k-mer pattern
Local temporary path
Exhaustive search mode          true
Prefilter mode                  0
Search iterations               1
Remove temporary files          true
MPI runner
Force restart with latest tmp   false
Cluster search                  0

structurealign targetDB targetDB tmp/15707625884678452062/pref tmp/15707625884678452062/strualn --tmscore-threshold 0 --lddt-threshold 0 --sort-by-structure-bits 1 --alignment-type 2 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 1 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 0.5 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 14 --compressed 0 -v 3

[=================================================================] 100.00% 75 0s 223ms
Time for merging to strualn: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 282ms
mvdb tmp/15707625884678452062/strualn tmp/15707625884678452062/aln

Time for processing: 0h 0m 0s 2ms
mvdb tmp/15707625884678452062/aln alnDB -v 3

Time for processing: 0h 0m 0s 3ms
Removing temporary files
rmdb tmp/15707625884678452062/pref -v 3

Time for processing: 0h 0m 0s 1ms
output.txt exists and will be overwritten
convertalis targetDB targetDB alnDB output.txt --format-mode 4 --format-output query,target,qlen,tlen,alnlen,qtmscore,ttmscore,alntmscore,cigar,qseq,tseq

MMseqs Version:         9.427df8a
Substitution matrix     aa:3di.out,nucl:3di.out
Alignment format        4
Format alignment output query,target,qlen,tlen,alnlen,qtmscore,ttmscore,alntmscore,cigar,qseq,tseq
Gap open cost           aa:10,nucl:10
Gap extension cost      aa:1,nucl:1
Database output         false
Preload mode            0
Threads                 14
Compressed              0
Verbosity               3
Exact TMscore           0

[=================================================================] 100.00% 75 0s 574ms
Time for merging to output.txt: 0h 0m 0s 3ms
Time for processing: 0h 0m 0s 715ms

Your Environment

Include as many relevant details about the environment you experienced the bug in.

EDIT: The behavior is observed in the following versions (all installed via conda):

The script fails in 5.53465f0 because qtmscore has not been implemented.

Related issues

https://github.com/steineggerlab/foldseek/issues/221

ekiefl commented 2 months ago

The behavior is also observed with easy-search:

rm -rf tmp
foldseek easy-search ./structures/ ./structures/ output.txt tmp -a --exhaustive-search --format-mode 4 --format-output query,target,qlen,tlen,alnlen,qtmscore,ttmscore,alntmscore,cigar,qseq,tseq
austinhpatton commented 2 months ago

I wanted to dig into this error a bit more, as I've found that in some cases we don't observe the equivalence between ttmscore and alntmscore. I suspected it had something to do with the structure of the --format-output specification. So I tried six combinations, changing the order in which alntmscore, qtmscore, and ttmscore were requested.

The commands used are shown below:

foldseek easy-search ./structures/ ./structures/ v1.m8 tmp_dir_1 --exhaustive-search 1 --alignment-type 2 --format-mode 4 --format-output query,target,qlen,tlen,alnlen,qtmscore,ttmscore,alntmscore
foldseek easy-search ./structures/ ./structures/ v2.m8 tmp_dir_2 --exhaustive-search 1 --alignment-type 2 --format-mode 4 --format-output query,target,qlen,alnlen,tlen,qtmscore,alntmscore,ttmscore
foldseek easy-search ./structures/ ./structures/ v3.m8 tmp_dir_2 --exhaustive-search 1 --alignment-type 2 --format-mode 4 --format-output query,target,tlen,alnlen,qlen,ttmscore,alntmscore,qtmscore
foldseek easy-search ./structures/ ./structures/ v4.m8 tmp_dir_3 --exhaustive-search 1 --alignment-type 2 --format-mode 4 --format-output query,target,tlen,qlen,alnlen,ttmscore,qtmscore,alntmscore
foldseek easy-search ./structures/ ./structures/ v5.m8 tmp_dir_4 --exhaustive-search 1 --alignment-type 2 --format-mode 4 --format-output query,target,alnlen,qlen,tlen,alntmscore,qtmscore,ttmscore
foldseek easy-search ./structures/ ./structures/ v6.m8 tmp_dir_5 --exhaustive-search 1 --alignment-type 2 --format-mode 4 --format-output query,target,alnlen,tlen,qlen,alntmscore,ttmscore,qtmscore

I modified @ekiefl's python script to check if alntmscore was always equal to either ttmscore or qtmscore given an output file provided from the commandline - this is shown below:

import pandas as pd
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("input_file", type=str)
args = parser.parse_args()

df = pd.read_csv(args.input_file, sep="\t")

# Check if alntmscore equals ttmscore or qtmscore
if (df.ttmscore == df.alntmscore).all():
    print("All alntmscore values equal ttmscore.")
elif (df.qtmscore == df.alntmscore).all():
    print("All alntmscore values equal qtmscore.")
else:
    print("alntmscore values do not match ttmscore or qtmscore consistently.")

print(
    df[["query", "target", "qlen", "tlen", "alnlen", "qtmscore", "ttmscore", "alntmscore"]]
    .sample(20)
    .tail(10)
    .to_markdown()
)

And using this script, we find the following:

python foldseek_debug.py v1.m8
All alntmscore values equal ttmscore.
|      | query      | target     |   qlen |   tlen |   alnlen |   qtmscore |   ttmscore |   alntmscore |
|-----:|:-----------|:-----------|-------:|-------:|---------:|-----------:|-----------:|-------------:|
| 1694 | Q9PRC5.pdb | V4TDT2.pdb |    230 |    539 |      286 |     0.6175 |     0.2986 |       0.2986 |
|  317 | B0K7K7.pdb | B1AI06.pdb |    203 |    230 |      209 |     0.88   |     0.7825 |       0.7825 |
| 2455 | B1L821.pdb | U5U2L0.pdb |    197 |    550 |      182 |     0.5044 |     0.2132 |       0.2132 |
| 4026 | Q20230.pdb | W5PRH4.pdb |    191 |    277 |      245 |     0.6217 |     0.4638 |       0.4638 |
| 5581 | L1J5L1.pdb | Q54UT2.pdb |    254 |    285 |      256 |     0.6551 |     0.5885 |       0.5885 |
| 2561 | U7PJT1.pdb | W5N0Q0.pdb |    257 |    263 |      288 |     0.3744 |     0.368  |       0.368  |
| 1031 | Q8R9S6.pdb | I3MMU0.pdb |    206 |    277 |      276 |     0.3828 |     0.3006 |       0.3006 |
| 1800 | Q54UT2.pdb | Q54UT2.pdb |    285 |    285 |      285 |     1      |     1      |       1      |
| 1023 | Q8R9S6.pdb | L8IGY7.pdb |    206 |    256 |      252 |     0.4647 |     0.3913 |       0.3913 |
| 4843 | Q4DE18.pdb | K7GFB3.pdb |    351 |    242 |      240 |     0.4184 |     0.5662 |       0.5662 |

python foldseek_debug.py v2.m8
All alntmscore values equal qtmscore.
|      | query      | target     |   qlen |   tlen |   alnlen |   qtmscore |   ttmscore |   alntmscore |
|-----:|:-----------|:-----------|-------:|-------:|---------:|-----------:|-----------:|-------------:|
| 3434 | B1AIY1.pdb | B0U447.pdb |    224 |    217 |      260 |     0.3748 |     0.3864 |       0.3748 |
|  369 | W5N0Q0.pdb | P38493.pdb |    263 |    224 |      274 |     0.3627 |     0.4111 |       0.3627 |
| 1324 | P38493.pdb | B7MTM8.pdb |    224 |    213 |      263 |     0.39   |     0.4051 |       0.39   |
| 4291 | B0U447.pdb | B1L821.pdb |    217 |    197 |      196 |     0.7853 |     0.8619 |       0.7853 |
| 1887 | B0W8G8.pdb | W5N438.pdb |    246 |    264 |      258 |     0.8034 |     0.7519 |       0.8034 |
| 4467 | P48769.pdb | B7MTM8.pdb |    260 |    213 |      258 |     0.5212 |     0.6163 |       0.5212 |
| 5507 | P27707.pdb | W0R9N0.pdb |    260 |    221 |      246 |     0.676  |     0.7849 |       0.676  |
| 4474 | P48769.pdb | Q7VKH4.pdb |    260 |    214 |      265 |     0.5115 |     0.6031 |       0.5115 |
| 4617 | B1L821.pdb | Q20230.pdb |    197 |    191 |      211 |     0.4935 |     0.5054 |       0.4935 |
| 1502 | V4TDT2.pdb | V7BU96.pdb |    539 |    563 |      564 |     0.622  |     0.5973 |       0.622  |

python foldseek_debug.py v3.m8
All alntmscore values equal ttmscore.
|      | query      | target     |   qlen |   tlen |   alnlen |   qtmscore |   ttmscore |   alntmscore |
|-----:|:-----------|:-----------|-------:|-------:|---------:|-----------:|-----------:|-------------:|
| 1729 | B7MTM8.pdb | Q7VKH4.pdb |    213 |    214 |      209 |     0.9559 |     0.9515 |       0.9515 |
| 4020 | B1I165.pdb | L8IGY7.pdb |    221 |    256 |      249 |     0.5575 |     0.4923 |       0.4923 |
| 2513 | B0W8G8.pdb | B8F7G0.pdb |    246 |    208 |      235 |     0.5625 |     0.6493 |       0.6493 |
| 4939 | V4T3S7.pdb | Q1QEQ3.pdb |    470 |    230 |      205 |     0.2217 |     0.4064 |       0.4064 |
| 4871 | Q20230.pdb | V4KAC2.pdb |    191 |    599 |      182 |     0.511  |     0.1944 |       0.1944 |
| 3962 | W5JIH9.pdb | P63807.pdb |    275 |    219 |      265 |     0.3686 |     0.4393 |       0.4393 |
| 4892 | V4T3S7.pdb | I3MMU0.pdb |    470 |    277 |      278 |     0.3968 |     0.6615 |       0.6615 |
| 2351 | Q6FZE3.pdb | B1JI38.pdb |    192 |    212 |      231 |     0.4975 |     0.4619 |       0.4619 |
| 2572 | I3MMU0.pdb | B0W8G8.pdb |    277 |    246 |      256 |     0.701  |     0.7851 |       0.7851 |
| 4191 | Q9PRC5.pdb | Q8Y5W6.pdb |    230 |    224 |      257 |     0.4048 |     0.4136 |       0.4136 |

python foldseek_debug.py v4.m8
All alntmscore values equal qtmscore.
|      | query      | target     |   qlen |   tlen |   alnlen |   qtmscore |   ttmscore |   alntmscore |
|-----:|:-----------|:-----------|-------:|-------:|---------:|-----------:|-----------:|-------------:|
| 1507 | W5PR96.pdb | Q58EI2.pdb |    260 |    264 |      264 |     0.9114 |     0.898  |       0.9114 |
| 3144 | B6U6Y9.pdb | Q3AFE0.pdb |    489 |    289 |      181 |     0.1788 |     0.2795 |       0.1788 |
| 3583 | W5JIH9.pdb | Q20230.pdb |    275 |    191 |      224 |     0.4478 |     0.6059 |       0.4478 |
|  690 | O83373.pdb | B7MTM8.pdb |    208 |    213 |      217 |     0.7442 |     0.7285 |       0.7442 |
| 5293 | U7PMD6.pdb | B2UVL9.pdb |    307 |    191 |      279 |     0.3281 |     0.4828 |       0.3281 |
| 3994 | B4TFH5.pdb | Q9PRC5.pdb |    213 |    230 |      216 |     0.8171 |     0.7617 |       0.8171 |
| 4220 | V4TDT2.pdb | V6DIT2.pdb |    539 |    234 |      275 |     0.4053 |     0.9165 |       0.4053 |
| 2414 | V4T3S7.pdb | M3ZT88.pdb |    470 |    263 |      279 |     0.4116 |     0.7195 |       0.4116 |
| 2393 | Q6FZE3.pdb | B6U6Y9.pdb |    192 |    489 |      200 |     0.4129 |     0.1927 |       0.4129 |
| 2929 | W1NL77.pdb | I1HTV7.pdb |    586 |    517 |      521 |     0.5987 |     0.6735 |       0.5987 |

python foldseek_debug.py v5.m8
alntmscore values do not match ttmscore or qtmscore consistently.
|      | query      | target     |   qlen |   tlen |   alnlen |   qtmscore |   ttmscore |   alntmscore |
|-----:|:-----------|:-----------|-------:|-------:|---------:|-----------:|-----------:|-------------:|
|   22 | B0K7K7.pdb | O83373.pdb |    203 |    208 |      210 |     0.8133 |     0.7951 |       0.8133 |
| 1205 | M8CV53.pdb | U5U2L0.pdb |    526 |    550 |      520 |     0.6098 |     0.5853 |       0.6161 |
| 1249 | M8CV53.pdb | B0K7K7.pdb |    526 |    203 |      198 |     0.211  |     0.479  |       0.4891 |
| 2314 | B1JI38.pdb | Q3AFE0.pdb |    212 |    289 |      209 |     0.3783 |     0.3027 |       0.3827 |
| 4951 | I1HTV7.pdb | M8CV53.pdb |    517 |    526 |      468 |     0.6866 |     0.6754 |       0.7547 |
| 1175 | V6DIT2.pdb | B1J4Z5.pdb |    234 |    207 |      243 |     0.51   |     0.5661 |       0.5661 |
| 5024 | I1HTV7.pdb | Q8R9S6.pdb |    517 |    206 |      305 |     0.2084 |     0.4336 |       0.4336 |
| 3224 | P48769.pdb | B1AIY1.pdb |    260 |    224 |      292 |     0.3047 |     0.3421 |       0.3421 |
| 1199 | V6DIT2.pdb | B1J5H0.pdb |    234 |    228 |      272 |     0.4114 |     0.4199 |       0.4199 |
| 2686 | L8IGY7.pdb | C5A558.pdb |    256 |    190 |      235 |     0.4193 |     0.5343 |       0.5343 |

python foldseek_debug.py v6.m8
alntmscore values do not match ttmscore or qtmscore consistently.
|      | query      | target         |   qlen |   tlen |   alnlen |   qtmscore |   ttmscore |   alntmscore |
|-----:|:-----------|:---------------|-------:|-------:|---------:|-----------:|-----------:|-------------:|
| 3852 | B8F7G0.pdb | W5PR96.pdb     |    208 |    260 |      259 |     0.6604 |     0.545  |       0.6604 |
| 5475 | B1J4Z5.pdb | B1J4Z5.pdb     |    207 |    207 |      207 |     1      |     1      |       1      |
| 2670 | Q9XZT6.pdb | B1J4Z5.pdb     |    250 |    207 |      230 |     0.5031 |     0.5887 |       0.5887 |
| 3688 | I3MMU0.pdb | A0A818UTT8.pdb |    277 |    249 |      243 |     0.6877 |     0.7622 |       0.7804 |
| 3365 | Q4DE18.pdb | B1AIY1.pdb     |    351 |    224 |      267 |     0.2955 |     0.422  |       0.422  |
| 3928 | Q8Y5W6.pdb | I3MMU0.pdb     |    224 |    277 |      278 |     0.4254 |     0.3598 |       0.4254 |
| 4085 | W6ULK1.pdb | Q9UXG7.pdb     |    186 |    189 |      173 |     0.5381 |     0.5309 |       0.5719 |
| 3140 | B0K7K7.pdb | Q8R9S6.pdb     |    203 |    206 |      230 |     0.4541 |     0.4497 |       0.4541 |
| 4465 | Q834T6.pdb | P48769.pdb     |    226 |    260 |      265 |     0.4025 |     0.3607 |       0.4025 |
| 5571 | O83373.pdb | B1AI06.pdb     |    208 |    230 |      219 |     0.7591 |     0.6935 |       0.7591 |

In brief:

  1. When alntmscore is not the first of the three TM-scores requested by --format-output, it always equals either ttmscore or qtmscore
    • Which score it is equal to switches dependent on the order in which columns are requested
  2. When alntmscore is the first of the three TM-scores requested by --format-output, it does not consistently equal one or the other, but often (though not always) is equal to one of them, but not in a manner that is consistently predicted by qlen, tlen, or alnlen
ekiefl commented 3 weeks ago

Bumping this issue, @milot-mirdita and @martin-steinegger.

Below is an example of how detrimental this bug is.

I performed a one-versus-many calculation of TM-score with both Foldseek and TMalign. For each TMalign result, the "alntmscore" is calculated manually with the following prescription:


image


Then, the scores are ranked from high to low and the result is shown as the monotonically increasing trace (blue). Using the same ordering, the foldseek alntmscore results are shown as red dots using the following settings:

prefilter_mode=2
alignment_type=1
tmalign_fast=0
exact_tmscore=1
image

If instead I manually calculate alntmscore from the Foldseek values qlen, qtmscore, tlen and tmscore according to the above prescription, the results between TMalign and Foldseek converge:

image

As far as I can tell, everyone who uses the alntmscore output by Foldseek gets results akin to the first plot, as is demonstrated in the MRE I presented above.

martin-steinegger commented 3 weeks ago

Thank you for the analysis. How do you compute the alntmscore? I checked the code we do use std::min(static_cast(res.backtrace.size()), std::min(res.dbLen, res.qLen))) as normalization factor.

Normalizing by res.backtrace.size() might be better, maybe we should change this.

martin-steinegger commented 3 weeks ago

Okay I think I know what is going on. If you print out qtmscore,ttmscore,alntmscore then the alntmscore = ttmscore. If you print ttmscore,qtmscore,alntmscore then alntmscore = qtmscore; However, if you print alntmscore,qtmscore,ttmscore then it should work. This is a bug, which I will fix soon.

ekiefl commented 3 weeks ago

Thanks for the response.

How do you compute the alntmscore?

I provided some equations above, perhaps they didn't render. Or is there something more specific you are curious about?

In case of TMalign, our ranking is done by (qTM+tTM)/2. Might this explain why you see this kind of ranking?

Each comparison, whether calculated by TMalign or Foldseek, is ordered according to TMAlign's alntmscore. That's why the TMalign curve monotonically increases. So Foldseek's ranking is irrelevant given how the data has been presented.

Normalizing by res.backtrace.size() might be better, maybe we should change this.

Given that ttmscore is normalized by tlen, and qtmscore is normalized by qlen, I think I agree that alntmscore should be calculated by normalizing by alnlen.


However, a bug persists even if this were changed, as illustrated in this table output from the MRE. Allow me to explain.

Basically, in the table alntmscore always equates with ttmscore. Given that the normalization is min(alnlen, qlen, tlen), the only way in which this would be possible is if tlen is always smaller thanalnlen, however, the other columns show that isn't the case.

query target qlen tlen alnlen qtmscore ttmscore alntmscore
1187 B0RXV1 V7BU96 227 563 206 0.5245 0.2354 0.2354
5406 U5U2L0 V4KAC2 550 599 560 0.6095 0.5638 0.5638
3335 Q5F9Z5 Q9XZT6 206 250 234 0.6303 0.5362 0.5362
861 B0BUU8 Q20230 203 191 210 0.5333 0.5591 0.5591
842 B0BUU8 B0RXV1 203 227 199 0.8489 0.7633 0.7633
2810 K0F1X4 B1JTS0 178 206 211 0.6423 0.5692 0.5692
2927 L8IGY7 P48769 256 260 254 0.9265 0.9125 0.9125
4925 Q8R9S6 Q834T6 206 226 245 0.4625 0.4305 0.4305
3861 W5N438 B8F7G0 264 208 255 0.5412 0.6642 0.6642
960 B0K119 C5A558 203 190 218 0.5143 0.5417 0.5417
martin-steinegger commented 3 weeks ago

@ekiefl thank you so much. Please see my comment above. Could you please try to print out the tmscores in this order: alntmscore,qtmscore,ttmscore. Does this change anything?

ekiefl commented 3 weeks ago

Okay I think I know what is going on. If you print out qtmscore,ttmscore,alntmscore then the alntmscore = ttmscore. If you print ttmscore,qtmscore,alntmscore then alntmscore = qtmscore; However, if you print alntmscore,qtmscore,ttmscore then it should work. This is a bug, which I will fix soon.

Exactly. It has something to do with this variable: https://github.com/steineggerlab/foldseek/blob/bc212bc8602ef426c7b58368c65dd744443f802c/src/strucclustutils/structureconvertalis.cpp#L891

Given that the normalization is currently min(alnlen, qlen, tlen), @austinhpatton's comment makes sense:

When alntmscore is the first of the three TM-scores requested by --format-output, it does not consistently equal one or the other, but often (though not always) is equal to one of them, but not in a manner that is consistently predicted by qlen, tlen, or alnlen


Could you try to print out the tmscores in this order: alntmscore,qtmscore,ttmscore. Does this change anything?

The specific example stems from a subset of a larger all-vs-all, so it's not easy to re-run the results. But, I have just confirmed the effect that the order alntmscore,qtmscore,ttmscore has on the MRE, which now produces this table:

query target qlen tlen alnlen qtmscore ttmscore alntmscore
3599 B1I165 U7PMD6 221 307 303 0.4044 0.315 0.4044
3772 B1JI38 Q9UXG7 212 189 209 0.7307 0.8108 0.8108
2365 I3MMU0 B4TFH5 277 213 254 0.4645 0.5776 0.5776
4929 Q20230 P0C1G0 191 228 230 0.6095 0.5279 0.6095
1568 Q9UXG7 Q8R9S6 189 206 217 0.4942 0.4614 0.4942
5620 Q7VKH4 U7PMD6 214 307 299 0.4353 0.3304 0.4353
760 W5N0Q0 V4CH82 263 229 262 0.7477 0.8545 0.8545
2397 I3MMU0 Q3AFE0 277 289 234 0.3628 0.3507 0.4174
5389 Q5F9Z5 B1AIY1 206 224 262 0.4457 0.4171 0.4457
1115 W6ULK1 P63807 186 219 221 0.2924 0.2573 0.2924

This table matches what one expects if alntmscore is normalized by the minimum of the lengths.


When you get around to fixing the bug, may I suggest reporting a rawtmscore that one can normalize how they see fit?

martin-steinegger commented 3 weeks ago

Thank you Evan. I pushed a fix. Could you retry it with the newest version? It should fix the order issue and I changed the normalization to the backtrace size as well.

ekiefl commented 3 weeks ago

Great. I can test it out if you point me to some instructions for building from source.