Open amueller opened 5 years ago
MNIST (41063) should probably be deactivated. It only has the training data, not the test set. If necessary we can rename mnist_784.
I think that would be good. Or there could be aliases ;)
We had discusses having more common identifiers, right?
I just don't want to have users to remember that instead of fetch_mldata("MNIST original")
they now have to call fetch_openml("mnist_784")
We currently don't have aliases for datasets. @janvanrijn what do you think?
What we could do is adjust the functionality of the data listing function such that is uses a wildcard search through the database, rather than an EQUALS search. That would solve this particular MNIST problem.
I am hesitant to add an alias functionality, as I currently do not see a way to add that without extending the maintenance load. If you have a proposal that takes this into account that would change matters, I would be glad to hear it.
Sorry, no proposal that wouldn't cause significant overhead. Not sure if doing a wildcard search by default is wise. I guess if we warn accordingly and provide the actual data set? There are so many variants of the same data on OpenML it's hard for users to ensure they have the "right" version. There's usually a canonical version of each dataset (there's a canonical MNIST, there's a canonical iris, there's a canonical titanic etc) but these are hard to find on openml.
we will discuss this issue tomorrow and probably enable dataset renaming for a group of editors.
It's weird to an sklearn user to have
fetch_openml("MNIST")
return "no dataset MNIST". There is an MNIST dataset on OpenML, which is "in perparation".