openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
664 stars 90 forks source link

Export OpenML data to data package, Import data package to OpenML #482

Open HeidiSeibold opened 6 years ago

HeidiSeibold commented 6 years ago

Would be nice if we could do that. Would make it much easier for people to upload data to and to work with data from OpenML.

Data packages are defined by:

Why?

To improve user friendliness. See also: https://docs.google.com/document/d/1c_RhDiXTK5bEsY5gGRuQwaF6fKilt4jKq2c_BRqyEDc/edit?usp=sharing

How is meta data specified in data packages?

https://specs.frictionlessdata.io/data-package/#metadata

This is related to #457

HeidiSeibold commented 6 years ago

To be able to import data packages into OpenML I think we need to first do the following steps:

Any help on this issue would be very appreciate :clap: :cake:

pwalsh commented 6 years ago

Copying my response from gitter.im at request of @HeidiSeibold


I note there are a few libs in Python for ARFF

And we have a documented way to convert to/from data backends here:

https://github.com/frictionlessdata/tableschema-py#storage

And some example implementations of the storage API at:

So writing an ARFF backend would be great!

HeidiSeibold commented 6 years ago

The issue https://github.com/datahubio/qa/issues/33 is in prinicple the same just the other way around. Both are equally helpful and important :smiley:

joaquinvanschoren commented 6 years ago

Interesting dataset: Maybe this one is a good place to start: https://datahub.io/anuveyatsu/farm-survey-simple

Nice thing is that they have all the attribute file types, offered as a JSON file. Should be easy to convert to ARFF. What is still missing is the task, i.e. what you want to predict. There is also no description of what the dataset is about.

I also couldn't figure out how to navigate DataHub. There are apparently 200+ datasets but I can only see a few of them on the website.

HeidiSeibold commented 6 years ago

I also couldn't figure out how to navigate DataHub. There are apparently 200+ datasets but I can only see a few of them on the website.

That's a bug https://github.com/datahubio/qa/issues/32

joaquinvanschoren commented 6 years ago

Feedback from the frictionlessdata gitter:

HeidiSeibold commented 6 years ago

There are now some machine learning data sets available as data packages: http://datahub.io/machine-learning

Example: http://datahub.io/machine-learning/seismic-bumps Also available on OpenML: https://www.openml.org/d/1500

I guess a first step now would be to check:

See also discussion https://github.com/datahq/datahub-qa/issues/33#issuecomment-356368024

HeidiSeibold commented 6 years ago

We decided ot wait until https://github.com/frictionlessdata/datapackage-r/issues/13 is properly solved.