mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

Base Roundtrip Accuracy #4

Closed lhallee closed 7 months ago

lhallee commented 7 months ago

Hello,

Did you guys try / record the base roundtrip accuracy for the test set? Best, Logan

mheinzinger commented 7 months ago

Hi, yes, I recorded this and I tried to recover the file you are asking for (I assume: "base" roundtrip accuracy means roundtrip accuracy for generating only a single candidate sequences, i.e., without generating/filtering until a roundtrip accuracy of e.g. >70 was reached). Hope it helps; roundtrip_statistics.csv Best, Michael Ps.: there is also a column saying "PPL" because I tried to find some correlation to perplexity but the "roundtrip-accuracy" direction appeared more promising/easier_to_interpret to me. That being said: the PPL field always says -666 because I did not compute it for this run.

lhallee commented 7 months ago

Thanks so much for sharing, this is perfect. I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?

lhallee commented 7 months ago

Also, is it okay to report this as SOTA round trip accuracy? The average was 70.6 from what you sent. We have a bert-like model that is getting 70+% on the same data so I think this is a great comparison. SAProt can't do this task because they didn't train on filling their structure tokens.

mheinzinger commented 7 months ago

I assume similarity is the accuracy % IE correct if same token, incorrect if not, correct/total?

Ah, no, sorry I should have added this is: I used the Foldseek substitution matrix together with global alignment.

Also, is it okay to report this as SOTA round trip accuracy?

Oh, thanks for asking :) - So if I had to redo this right now, I would go for our dedicated 3Di predictor. It is essentially just a 2-layer CNN trained on top of embeddings of ProstT5's encoder. So if you do not need a distribution over solutions but rather only a single solution (as you usually do if you want to use 3Di for searching remote homologs, or, if you want to use it for roundtrip-based filtering), I would go for this one. The rational is that it is MUCH faster as you do not need to decode token-by-token but rather have a single forward pass through the encoder which translates all amino acids in the input to 3Di tokens (probably similar to your BERT-like model). sorry for not having it in the paper yet but this is ongoing work and in the next iteration of the paper, we'll also describe this CNN.

For SAProt: maybe I would have to re-read the paper but didn't they have these mixed tokens of "Aa" (upper case amino acids, lower case 3Di; or vice versa not sure) where you could mask out either 3Di ("A?") or amino acids ("?a") and reconstruct it (so you could mask out all 3Di tokens and ask the model to reconstruct it)?

lhallee commented 7 months ago

Thanks for the info! Yes, I think the 3Di predictor is a great comparison to our BERT-like model. Is there a checkpoint available for it yet? If not I will await the next iteration of the paper.

You are correct about the mixed tokens, it is the same approach we took. However, they chose to only mask amino acid portions during training, so their model cannot recover masked 3Di tokens with anything beyond random chance. Our approach was to mask amino acid portions and 3Di, or both, so it is quite good at this round-trip task.

mheinzinger commented 7 months ago

Yes, you can run the script I linked above to get predictions from our dedicated 3Di predictor (it already downloads the checkpoint); you just have to give some input fasta file as explained here under predict_3Di_encoderOnly.py : https://github.com/mheinzinger/ProstT5/tree/main/scripts

Cool, great to hear that you got promising results from this approach :) - let me know once you have a pre-print up

lhallee commented 7 months ago

Thanks for the help! Will do :)