polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.
https://polaris-hub.github.io/polaris/
Apache License 2.0
77 stars 4 forks source link

Specification of validation set in Benchmark object #27

Closed zhu0619 closed 11 months ago

zhu0619 commented 1 year ago

I am not sure if this has been discussed. But why not to include the validation set in benchmark specification? Don't we want to make sure user also use the same validation set? @cwognum @hadim Just want to make sure before I make any changes.

cwognum commented 1 year ago

Hey @zhu0619, this is intentional! How you split the dataset in a train and validation set should be considered a design choice in method development (e.g. this is also one of the tools we benchmarked in MOOD). Methods that use different validation sets can still be compared as long as the test set is the same.

I actually think there are some interesting, open questions as to what makes for a good validation set. E.g. should the validation set be very different from the train set set to evaluate for generalization or should you rather use a random split such that the model sees more diverse chemistry during training. Another example could be around data pruning. And finally, some methods might not even use a validation set (e.g. physics-informed).

Hope that helps and clarifies my thinking. Happy to discuss further!

zhu0619 commented 1 year ago

Thanks for the clarification @cwognum. "what makes for a good validation set" is an important question. And I agree that the participants should make a decision on that. But if we have a specific “goal” on that benchmark which requires a specific validation set, then we should define the validation set from our part. To me, whether or how specify a validation set is also a part of benchmark design, depending on the context.

Currently, I am trying to build some baselines. I think I will keep thing simple and use random split for validation for now.

cwognum commented 1 year ago

But if we have a specific “goal” on that benchmark which requires a specific validation set

Could you give an example of one such goal? I can't think of any!

zhu0619 commented 1 year ago

Here is a typical example in the drug discovery. Let's say there is a project started with series A then expand to series B and series C. The number of data points A> B>C. C is a series to explore further and has fewer data points. The scaffolds of B and C are relatively closer. Our goal is to transfer the knowledge from B to C and have a model which can generalize well to C. In this case, maybe we would like to bias the validation set to B on purpose.

I don't want to disturb you too much from your vacation. We can continue the discussion next week. :)

cwognum commented 1 year ago

In the example you gave, I think using B as a validation set could be a good strategy, but should not be enforced. E.g. maybe using a random split ultimately results in a better model on the test set (i.e. series C), since the model sees more diverse chemistry (i.e. part of series B) during training. Ultimately, I don't think we care about the intermediate results on the validation set. We care about the results on the test set. I am not yet convinced a fixed validation set is needed.

Having said that, the current API does support multiple (named) test sets, so we can add two "test" sets, one named "validation" and one named "test". This could be used to specify a validation set!