Simple format downloads of the datasets (especially the small ones)

Is your feature request related to a problem? Please describe.

The API you're developing is nice where you can load in and split the dataset for each benchmark. However, there is an issue of flexibility and control when people want to perform their own splitting regime for integration with previous studies. Especially in the case of the (currently) smaller datasets (kinases) where LOO-CV is really needed.

Describe the solution you'd like

It might not be a very "techy" way of doing things - but having the option to just download the datasets in a JSON/hdf5/.txt/.csv/simple format would be a nice feature to get up and running with integrating the curated datasets into existing workflows without a fuss. - Then the flexibility and control is up to us

Hey @PaulC61, thanks for the kind words and thank you for the feedback!

To answer your question first, you can download all data associated with a dataset using

dataset.to_json(destination=...)

(Side note: We should probably update the name of that function... Something like download or save seems more clear)

As described in the docs, this will lead to the download of multiple files. In most cases, you will likely be interested in solely the table.parquet file, which is the Pandas DataFrame that holds the actual data. FYI - You can also access this same table using dataset.table.

Now having said that, we have something called "pointer columns" (see the docs) and are working on a new dataset implementation (to be announced soon!) that builds on that. This new implementation will actually forego the need of the table attribute in favor of a Zarr-only implementation. These cases may make it more difficult to do what you're interested in!

So why do we do so? We have good reason to! We are purposefully offering up some flexibility in favor of standardization and easy of use.

Standardization: If everyone were to have easy access to the raw dataset, we have little control over how people use the data. Different people may preprocess the data differently, leading to incomparable results.
Ease of use: We are trying to build a universal data format for drug discovery, so that different subcommunities (e.g. phenomics, small-molecules, ...) don't have to reinvent the wheel for data wrangling, but can all use the same Polaris API. Some more context in this PyData talk.

Does that make sense?

A final word: If you're interested in LOO-CV, you may be interested in our paper on method comparison which is coming out very soon! See here for more context.

polaris-hub / polaris

Simple format downloads of the datasets (especially the small ones) #209

Is your feature request related to a problem? Please describe.

Describe the solution you'd like