polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.
https://polaris-hub.github.io/polaris/
Apache License 2.0
97 stars 6 forks source link

Simple format downloads of the datasets (especially the small ones) #209

Open PaulC61 opened 3 weeks ago

PaulC61 commented 3 weeks ago

Is your feature request related to a problem? Please describe.

The API you're developing is nice where you can load in and split the dataset for each benchmark. However, there is an issue of flexibility and control when people want to perform their own splitting regime for integration with previous studies. Especially in the case of the (currently) smaller datasets (kinases) where LOO-CV is really needed.

Describe the solution you'd like

It might not be a very "techy" way of doing things - but having the option to just download the datasets in a JSON/hdf5/.txt/.csv/simple format would be a nice feature to get up and running with integrating the curated datasets into existing workflows without a fuss. - Then the flexibility and control is up to us

cwognum commented 3 weeks ago

Hey @PaulC61, thanks for the kind words and thank you for the feedback!

To answer your question first, you can download all data associated with a dataset using

dataset.to_json(destination=...)

(Side note: We should probably update the name of that function... Something like download or save seems more clear)

As described in the docs, this will lead to the download of multiple files. In most cases, you will likely be interested in solely the table.parquet file, which is the Pandas DataFrame that holds the actual data. FYI - You can also access this same table using dataset.table.

Now having said that, we have something called "pointer columns" (see the docs) and are working on a new dataset implementation (to be announced soon!) that builds on that. This new implementation will actually forego the need of the table attribute in favor of a Zarr-only implementation. These cases may make it more difficult to do what you're interested in!

So why do we do so? We have good reason to! We are purposefully offering up some flexibility in favor of standardization and easy of use.

Does that make sense?

A final word: If you're interested in LOO-CV, you may be interested in our paper on method comparison which is coming out very soon! See here for more context.