polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.
https://polaris-hub.github.io/polaris/
Apache License 2.0
82 stars 4 forks source link

How to provide validation set partitions #139

Open fteufel opened 2 months ago

fteufel commented 2 months ago

Context

Datasets for deep learning typically feature an additional validation/dev partition in addition to train+test to be used to monitor model training. Right now, Polaris only accepts train and test for split when defining a benchmark.

Description

It will be necessary to also provide the validation dataset to the user. How should this be done? Extending the API, or 'abusing' the optional dictionary structure for multiple test sets?

cwognum commented 2 months ago

Hey @fteufel , thanks again! 😄

Great question! We deliberately don't include the specification of a validation set. To us, the choice of validation set is a design decision during method development that we don't want to constrain. For example for hyper-parameter search: Some may argue that a random split for the validation set is best because it results in a more diverse training set, whereas others may argue that some sort of OOD split is best because it selects hyper-parameters that are better suited for generalization. From Polaris' perspective, both are valid design decisions one can make during method development and it should be up to the user to decide.

What do you think of this reasoning? Do you agree? Since this turns into more of a philosophical discussion than a feature request and because I expect this question to come up more often, I suggest we move this conversation to a Github Discussion.


Having said all that, I do think we can do a better job at making it easy to split of a validation set as part of the Polaris API. I'm very much open to suggestions on how you think that could or should work! Maybe something like:

benchmark = po.load_benchmark(...)

train, test = benchmark.get_train_test_split(...)
train, val = train.split(indices=([0, 1, 2], [3]))
fteufel commented 2 months ago

I definitely agree that this can be left as a design decision - but if we already have some recommended OOD scheme defined on the data, it would be helpful to have a way to provide it to the user in case they wanna stick with it.

Right now I don't see a good way to provide it - if I understand correctly, when you access the data via the benchmark only, you are only ever exposed to target_cols and input_cols, and no additional info that might be helpful for splitting the validation set.

cwognum commented 2 months ago

and no additional info that might be helpful for splitting the validation set.

To make sure I understand: What type of information are you thinking of here?

fteufel commented 2 months ago

e.g. some structural/sequence identity clustering of the samples that was used for splitting. So something that would take quite some extra effort to reproduce ad hoc on the train data.

cwognum commented 2 months ago

Thank you for sharing some additional information.

Given the reasons outlined above, I still don't think we want to support specifying a fixed validation split. I do think, however, we should work on features that make it easier to split the train set in a train and validation portion (like adding a Subset.split() method as I suggested in the API above). Especially with such enhancements that make splitting the train set easier, I think replicating a validation split is easy enough.

some structural/sequence identity clustering of the samples that was used for splitting.

You could add a column to your dataset that contains that information if it's important! Building on the above API example:

benchmark = po.load_benchmark(...)

train, test = benchmark.get_train_test_split(...)

# Access the underlying dataset
df = benchmark.dataset.table
train_df = df.iloc[train.indices]

# Cluster based on a column in that dataset
clustering = train_df["clustering"]
train_ind = np.where(clustering =="train")[0]
val_ind = np.where(clustering == "val")[0]

# Get the train and validation split 
train, val = train.split(indices=(train_ind, val_ind))
fteufel commented 2 months ago

Sure, can do that. The reason why I asked in the first place was that the whole API gave me the impression that you wanted to discourage users from interacting with the actual dataframe directly - buried two levels down in the benchmark object.

If that's not the case, I guess I will just put it there and mention it in the readme.

cwognum commented 2 months ago

You're not wrong... This definitely would be a use case for advanced Polaris users, but that may be okay for a use case we don't want to officially support (at least not yet).

An alternative could be to specify it as an additional input column, but that feels a bit hacky. Maybe we could introduce "metadata columns" (in addition to target and input columns)? These would hold data you may want to use during training, but don't have to use? Not sure, feels a little convoluted and confusing as well! What do you think?

cwognum commented 2 months ago

Final thought: As you suggested, specifying a second "test" set and naming it validation could work too! I do think that makes it required for any result to submit their results on the validation set as well.

fteufel commented 2 months ago

Considering everything, providing it as a second test does sound the cleanest :)

Is there anything to consider to prevent people from then pushing results from the "wrong" test set back to the hub?

cwognum commented 2 months ago

You'll see results for both test sets on the leaderboard. You can select which results to view with the dropdown above the leaderboard.

fteufel commented 2 months ago

Ah yes. I see in the source that it will be mandatory to also evaluate the validation set then. Not really useful but I guess we can live with that.