Internal Data Storage Format

openml / OpenML

Open Machine Learning

https://openml.org

BSD 3-Clause "New" or "Revised" License

664 stars 90 forks source link

Internal Data Storage Format #388

Open janvanrijn opened 7 years ago

janvanrijn commented 7 years ago

It would be good to internally store tabular data in a structured way. This would give us the option to serve data in any way we seem fit, and not restrict us (and workbenches) to just arff.

The challenge is to do this in a general way, such that we are not restricted to the tabular data we are now focussed on, but also other sorts.

janvanrijn commented 7 years ago

Most prominently, this would give us the opportunity to serve datasets tailored for the task we are using it on.

Currently, workbench programmers are responsible to remove 'ignore' and 'row_id' attributes, which is prone to errors. When applying the mentioned feature request, this responsibility could (optionally) be carried by the OpenML servers; taking away a burden, code complexity and source of bugs from workbench developers.

jnothman commented 6 years ago

Desiderata:

It should handle the sparse and dense cases.
It should be easy to convert to/from ARFF
It would be best to follow an existing standard (Apache Arrow? netcdf?) and use existing tools.
It might be nice if it is a format that users can download / ingest directly, and more efficiently mapped to a memory model than ARFF (i.e. fixed width storage of each attribute).

amueller commented 5 years ago

Also see: https://github.com/openml/OpenML/issues/218

amueller commented 5 years ago

I would argue moving to parquet for dense data and possibly leaving the sparse data as arff would be an improvement already. I don't think there's that many sparse data formats around.