Open janvanrijn opened 7 years ago
Most prominently, this would give us the opportunity to serve datasets tailored for the task we are using it on.
Currently, workbench programmers are responsible to remove 'ignore' and 'row_id' attributes, which is prone to errors. When applying the mentioned feature request, this responsibility could (optionally) be carried by the OpenML servers; taking away a burden, code complexity and source of bugs from workbench developers.
Desiderata:
I would argue moving to parquet for dense data and possibly leaving the sparse data as arff would be an improvement already. I don't think there's that many sparse data formats around.
It would be good to internally store tabular data in a structured way. This would give us the option to serve data in any way we seem fit, and not restrict us (and workbenches) to just arff.
The challenge is to do this in a general way, such that we are not restricted to the tabular data we are now focussed on, but also other sorts.