openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

MNIST is called mnist_784 #13

Open amueller opened 5 years ago

amueller commented 5 years ago

It's weird to an sklearn user to have fetch_openml("MNIST") return "no dataset MNIST". There is an MNIST dataset on OpenML, which is "in perparation".

joaquinvanschoren commented 5 years ago

MNIST (41063) should probably be deactivated. It only has the training data, not the test set. If necessary we can rename mnist_784.

amueller commented 5 years ago

I think that would be good. Or there could be aliases ;) We had discusses having more common identifiers, right? I just don't want to have users to remember that instead of fetch_mldata("MNIST original") they now have to call fetch_openml("mnist_784")

joaquinvanschoren commented 5 years ago

We currently don't have aliases for datasets. @janvanrijn what do you think?

janvanrijn commented 5 years ago

What we could do is adjust the functionality of the data listing function such that is uses a wildcard search through the database, rather than an EQUALS search. That would solve this particular MNIST problem.

I am hesitant to add an alias functionality, as I currently do not see a way to add that without extending the maintenance load. If you have a proposal that takes this into account that would change matters, I would be glad to hear it.

amueller commented 5 years ago

Sorry, no proposal that wouldn't cause significant overhead. Not sure if doing a wildcard search by default is wise. I guess if we warn accordingly and provide the actual data set? There are so many variants of the same data on OpenML it's hard for users to ensure they have the "right" version. There's usually a canonical version of each dataset (there's a canonical MNIST, there's a canonical iris, there's a canonical titanic etc) but these are hard to find on openml.

janvanrijn commented 5 years ago

we will discuss this issue tomorrow and probably enable dataset renaming for a group of editors.