Data splits for stability task

The TAPE manuscript describes creating the splits for the stability task as

We create training, validation, and test splits ourselves, partitioning the data so that training and validation sets come from four rounds of experimental data measuring stability for many candidate proteins, while our test set consists of seventeen 1-Hamming distance neighborhoods around promising proteins observed in the four rounds of experimentation.

In stability_train.json from the raw data there are a few villin HP35 sequences such as "id": "villin_K48M". I expected all the villin sequences to be in the test split because this was one of the saturation mutagenesis experiments, not one of the four design rounds.

Were some of the saturation mutagenesis data included in the training or validation splits?

songlab-cal / tape

Data splits for stability task #134