mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

How well does this scale with length of amino acids? #23

Open Andy-B-123 opened 1 week ago

Andy-B-123 commented 1 week ago

Hi, thank you for this approach, absolutely fascinating! I'm hoping to use this in a workflow to assess and triage eukaryotic genome annotations using FoldSeek. The range of protein sequences can be small (<50 a. acids) to very large (many kbps). I see a figure in the manuscript where the range of amino acids evaluated was up to 500 but wanted to check if you had any information or details for larger genes, or if the results might be similar to the input database?

Anecdotally, for a small handful of genes up to 1.5kbp that I've checked manually the results from ProstT5 -> Foldseek are similar to ones I get for a FoldSeek search with the corresponding structure made from AlphaFold3.

Thank you for any thoughts on this!

mheinzinger commented 1 week ago

Hi, so we do not have enough data points from our benchmarks to reliably assess the performance for such long proteins. In general: yes, our model can provide output for arbitrary lengths of proteins. The major limitation will be vRAM at one point. T5 models use a relative/learnt positional encoding which was shown to extrapolate reasonably well (though not perfect) to sequences longer than those not seen during training. So long story short: I am sorry but I can not give you any more details than: yes, it should work for longer proteins but performance might get a bit worse the longer the sequence gets. But unfortunately, we have no proper benchmarks on this yet. In case you should measure sth at one point, I would be happy to read an update here :) (maybe it then also helps ppl in the future who might wonder the same)