tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019
Other
253 stars 75 forks source link

IMP topology great but secondary structure not so much? #12

Closed smsaladi closed 4 years ago

smsaladi commented 4 years ago

Cool work! And thanks for putting the code + weights + datasets out there.

I'm sort of surprised that the topology prediction is competitive with TOPCONS, yet the secondary structure prediction seems pretty far behind other methods.

Johansen uses a biRNN-CRF which is similar to the one you use, right: https://dl.acm.org/citation.cfm?doid=3107411.3107489, but they see much better predictive performance. Do you have a feel for why this is the case? Any chance you've evaluated against the datasets they have (CullPDB CB513 CASP10 CASP11)?

tbepler commented 4 years ago

The secondary structure prediction benchmark is strictly an internal comparison of the quality of the embeddings given by various ablated models. The secondary structure prediction model is extremely simple, predicting the secondary structure only from the embedding at that position. There is no CRF, CNN, RNN, etc. component that would incorporate additional sequence context or secondary structure transition probabilities.

I haven't done any benchmarks with more sophisticated decoders, but using one is likely to give competitive performance. For example, using the embeddings as features for the model you cite.

smsaladi commented 4 years ago

Ah I see. Thanks for the clarification!

tbepler commented 4 years ago

My pleasure!