openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
664 stars 90 forks source link

Internal Data Storage Format #388

Open janvanrijn opened 7 years ago

janvanrijn commented 7 years ago

It would be good to internally store tabular data in a structured way. This would give us the option to serve data in any way we seem fit, and not restrict us (and workbenches) to just arff.

The challenge is to do this in a general way, such that we are not restricted to the tabular data we are now focussed on, but also other sorts.

janvanrijn commented 7 years ago

Most prominently, this would give us the opportunity to serve datasets tailored for the task we are using it on.

Currently, workbench programmers are responsible to remove 'ignore' and 'row_id' attributes, which is prone to errors. When applying the mentioned feature request, this responsibility could (optionally) be carried by the OpenML servers; taking away a burden, code complexity and source of bugs from workbench developers.

jnothman commented 6 years ago

Desiderata:

amueller commented 5 years ago

Also see: https://github.com/openml/OpenML/issues/218

amueller commented 5 years ago

I would argue moving to parquet for dense data and possibly leaving the sparse data as arff would be an improvement already. I don't think there's that many sparse data formats around.