tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019
Other
253 stars 75 forks source link

Pfam preprocessing? #2

Closed nickbhat closed 5 years ago

nickbhat commented 5 years ago

I'd appreciate details on how you subset Pfam (e.g. threw out small families, long sequences, etc) for training the LM's initially. Couldn't find many details in the paper or the repo.

tbepler commented 5 years ago

No such preprocessing was done. You can download the exact dataset used following the link in the README.