Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
The TAPE manuscript describes creating the splits for the stability task as
We create training, validation, and test splits ourselves, partitioning the data so that training and validation sets come from four rounds of experimental data measuring stability for many candidate proteins, while our test set consists of seventeen 1-Hamming distance neighborhoods around promising proteins observed in the four rounds of experimentation.
In stability_train.json from the raw data there are a few villin HP35 sequences such as "id": "villin_K48M". I expected all the villin sequences to be in the test split because this was one of the saturation mutagenesis experiments, not one of the four design rounds.
Were some of the saturation mutagenesis data included in the training or validation splits?
I went back to the Rocklin 2017 supplementary files and see that these villin sequences are from the file rd4_stability_scores. That answers my question about the data splitting.
The TAPE manuscript describes creating the splits for the stability task as
In
stability_train.json
from the raw data there are a few villin HP35 sequences such as"id": "villin_K48M"
. I expected all the villin sequences to be in the test split because this was one of the saturation mutagenesis experiments, not one of the four design rounds.Were some of the saturation mutagenesis data included in the training or validation splits?