tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019
Other
253 stars 75 forks source link

Language Model Accuracy on Pfam? #4

Closed nickbhat closed 5 years ago

nickbhat commented 5 years ago

I couldn't find the next-token prediction accuracy of the biLM in the paper. Would you mind sharing what you got after one epoch on Pfam as described in supplement A?

tbepler commented 5 years ago

It has about 29% next token accuracy on the held out sequences (and roughly the same on the training sequences). This will increase if the model is trained longer, but I didn't find that this made any difference when just using the hidden states as features for the structure-based embedding model.

nickbhat commented 5 years ago

What was the train/test split? Something like 90/10?

tbepler commented 5 years ago

I used an 80/20 split. You can download the exact split from the link in the README (here).