songlab-cal / tape

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
https://www.biorxiv.org/content/10.1101/676825v1
BSD 3-Clause "New" or "Revised" License
658 stars 129 forks source link

Data splits for stability task #134

Closed agitter closed 1 year ago

agitter commented 1 year ago

The TAPE manuscript describes creating the splits for the stability task as

We create training, validation, and test splits ourselves, partitioning the data so that training and validation sets come from four rounds of experimental data measuring stability for many candidate proteins, while our test set consists of seventeen 1-Hamming distance neighborhoods around promising proteins observed in the four rounds of experimentation.

In stability_train.json from the raw data there are a few villin HP35 sequences such as "id": "villin_K48M". I expected all the villin sequences to be in the test split because this was one of the saturation mutagenesis experiments, not one of the four design rounds.

Were some of the saturation mutagenesis data included in the training or validation splits?

agitter commented 1 year ago

I went back to the Rocklin 2017 supplementary files and see that these villin sequences are from the file rd4_stability_scores. That answers my question about the data splitting.