Why have cross validation split?

lilleswing commented 3 years ago

First amazing work. Having these datasets, and official challenges is very very helpful to the community.

Having an explicit cross validation split seems weird to me. Shouldn't it be on the model to figure out what it needs to generalize to the test set? A specifically chosen cross validation set would be useful if it was drawn from the same distribution as the test set but this does not seem the case.

As you can see the distribution is different than the training set but also very different than the test set! What is the point of cross validating against this. It could be argued that in a time-series split one would not know the future distribution of the test set, but that brings us back into why should this benchmark set a cross validation set at all.

Are there rules for how to use the cross validation set? Is your model only allowed to "peak" these compounds so many times? Are ensemble models discouraged from the competition which do k-fold or bootstrap cross validation?

lilleswing commented 3 years ago

Here is a more quantitative depiction of how the distributions of the validation set and the test set don't align.

Screen Shot 2021-02-26 at 8 29 21 AM

If the CV set and the Test set were drawn from the same distributions the violins the bottom right four boxes would look the same or very similar, this is not the case.

kexinhuang12345 commented 3 years ago

Hey Karl! Thanks for the issue and the plots! Yes, so we were thinking to use the scaffold split to simulate the time-based split, which we don't have the timestamp info. This means that the model should be able to generalize to a structurally diverse set of drugs. So there is at least some level of structural diversity among train/valid/test sets, which is suggested by the violin plot you showed. Would love to chat more about the caveats using this split, especially on a relatively small set of dataset!

lilleswing commented 3 years ago

So my comment is both philosophical and laced with my own biases designing these models for industry. First other major datasets for competitions generally do not have "cross validation" labels (http://yann.lecun.com/exdb/mnist/, CIFAR-10, CIFAR-100), instead cross validation is left to the user to determine in order for them to know their models are not over-fitting the training set.

Second is a personal opinion -- given the chemical distinction between the cross validation set and the test set I think using the valid set for traditional cross validation is a bad idea from a model performance standpoint. Showing you can transfer to the specific distribution of the cross validation set is not the same as transferring to the distribution of the test set. Validation sets in general exist to a certain extent so that you can somewhat over-fit to the validation set and still expect good performance on the test set, I do not think this is the case here.

I think a better option would be one of the two of......

Just join cross validation and training together and let individual modelers choose how to add regularization to their models.
Shuffle the existing cross validation set and test set and then randomly split them back into cross validation and test sets. This is more representative of the lead optimization stage of a drug discovery project. The general chemistry is figured out but they are zoning in on small changes to get a molecule into clinic.

kexinhuang12345 commented 3 years ago

Thanks for the comment! That makes lots of sense, and I will go ahead to merge all training and validation set together and allow users to choose the split they like. Will update with a PR here later

kexinhuang12345 commented 3 years ago

see https://github.com/mims-harvard/TDC/pull/60 and the updated website: https://tdcommons.ai/benchmark/overview/ closing for now!

skalyanb commented 3 years ago

Just to confirm my understanding, we can split the training set into train and valid as we like for hyperparameter tuning. Is that correct?

kexinhuang12345 commented 3 years ago

Yes, that is correct.

skalyanb commented 3 years ago

Thank you for the clarification.

mims-harvard / TDC

Why have cross validation split? #58