steineggerlab / foldmason

Multiple Protein Structure Alignment at Scale with FoldMason
https://search.foldseek.com/foldmason
GNU General Public License v3.0
128 stars 13 forks source link

MSA LDDT averaging #6

Closed tjmier closed 2 months ago

tjmier commented 2 months ago

When calculating the MSA LDDT score foldmason sums the per-column average LDDT score then normalizes to the total number of columns.

When a column contains only gaps other than one sequence it cannot have an LDDT score (hence the value -1 you get from grep-ing the scores in the html #4), however this column contributes to the total number of columns being normalized to so the average LDDT being reported smaller than you would expect.

Hypothetical Alignment

---AAA
CCCAAA
---AAA

Hypothetical average LDDT scores per column [-1,-1,-1,0.9,0.9,0.9]

Expected MSA LDDT: 0.9 Reported MSA LDDT: 0.15

I am able to replicate this from my own analysis with foldmason when I pull the per-column LDDT scores from the html file. The reported msa LDDT score is normalized to the total number of columns despite whether or not they have an LDDT score.

Is this intentional?

tjmier commented 2 months ago

After thinking it over, I can see the merit in treating the single character columns as an LDDT score of 0. However, I think it could be useful to display the average of average LDDT score for all the columns as well as a score for the common core columns (gapless columns). A one off multidomain protein could drop the LDDT score significantly despite the rest of the alignment performing highly.

gamcil commented 2 months ago

Thank you for the feedback. Yeah, the idea was to still count non-scoring columns in the average LDDT in order to penalize super gappy MSAs, but reporting the common core score too would definitely have merit. Currently you can achieve this by adjusting --pair-threshold to exclude columns with a certain % of gap characters, but you have to run msa2lddt multiple times which is annoying - I think reporting multiple scores at different thresholds makes sense.

tjmier commented 2 months ago

One strategy that I've have been experimenting with is scoring the columns by taking the harmonic mean of the column occupancy and lddt score. This ensures that structurally homologous residues that are also present in the majority of structures are scored the highest. This ensure that columns that only contain a small amount of sequences with structurally homologous residues do not pollute the overall score.

You can also include a beta value to adjust the weight between lddt and occupancy, like what is done with F-score for precision and recall. To weight LDDT and occupancy equally beta would be set to 1.

CodeCogsEqn

I wonder if the alignments would improve if this was used as the scoring function while constructing the alignments.

image

gamcil commented 2 months ago

How did you calculate the scores here (0.74 vs 0.77)? Using msa2lddt on an alignment from easy-msa with 1aab and 1j46 I get 0.6 regardless of order in the file. The LDDT calculation is asymmetric but we pre-sort all of the sequences by their database index to remove the order dependency (as long as the same database is used).