Open amueller opened 6 years ago
It looks like "sensorless" is also missing, even though it's on UCI: https://archive.ics.uci.edu/ml/datasets/dataset+for+sensorless+drive+diagnosis
[done]
I kinda thought we had all UCI datasets?
both sector and sensorless seem interesting benchmark datasets for something like CC18 (though sector has probably too many classes?)
RCV1 is also missing? https://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+RCV2+Multilingual,+Multiview+Text+Categorization+Test+collection
There's only a binarized version, it should have 53 classes...
No wonder we have a hard time coming up with 100 datasets if all the standard datasets are missing ;)
And USPS is missing? https://www.otexts.org/1577
@joaquinvanschoren did something ever happen to the talks with MLData? They have all of these.
We should at least make an effort to get the datasets that are very popular on mldata: http://mldata.org/repository/data/by_downloads/
@janvanrijn maybe you can work on the python uploading interface to make this easy?
I’ve emailed mikio again this morning. Waiting for his reply. He has a database dump but still needs to extract the dataset info. On Tue, 3 Apr 2018 at 19:42, Andreas Mueller notifications@github.com wrote:
@joaquinvanschoren https://github.com/joaquinvanschoren did something ever happen to the talks with MLData? They have all of these.
We should at least make an effort to get the datasets that are very popular on mldata: http://mldata.org/repository/data/by_downloads/
@janvanrijn https://github.com/janvanrijn maybe you can work on the python uploading interface to make this easy?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/701#issuecomment-378354586, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV1DL-uBVARHqvyL5ocbLJdotJxlSks5tk8KAgaJpZM4TFkIL .
-- Thank you, Joaquin
I don't want to delay the stuff on CC18, but don't you think adding these would be helpful? I feel a bit odd about not including standard datasets simply because no-one has uploaded them (because uploading is too hard ;)
Actually MLData provides ARFF files for most datasets, so uploading wouldn't even be that hard.
The problem is that they have no index, you have to scrape the website to get the dataset names and info, and the website is not very responsive right now. On Tue, 3 Apr 2018 at 20:35, Andreas Mueller notifications@github.com wrote:
Actually MLData provides ARFF files for most datasets, so uploading wouldn't even be that hard.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/701#issuecomment-378369964, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV-hepbDiGOhhpA8MfaB8tEiE8uuJks5tk873gaJpZM4TFkIL .
-- Thank you, Joaquin
Depends on whether you want all of them or only the popular ones. From the site I linked above you can easily get the datasets I mentioned and a couple more that look interesting.
Doesn't seem so hard: https://www.openml.org/d/41070
but I feel like maybe you can find a student to do that?
(the arff listed the class as "continuous" and called it "int0", though. Is there a way to fix that in OpenML or do I need to manually edit the ARFF?
Right now you still need to edit the ARFF files manually.
On Tue, 3 Apr 2018 at 21:36 Andreas Mueller notifications@github.com wrote:
(the arff listed the class as "continuous" and called it "int0", though. Is there a way to fix that in OpenML or do I need to manually edit the ARFF?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/701#issuecomment-378390031, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV145-aCE3c_Hm7ZADuMYRtFItQcKks5tk906gaJpZM4TFkIL .
-- Thank you, Joaquin
a bunch of the datasets I mentioned are not on MLData, though, but they are on UCI. I had kind of assumed we had most of the UCI datasets and maybe we can have someone look into that?
This dataset seems interesting and is also missing: https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification#
Many of these would qualify for cc-18 btw, or cc-20 or whatever.
Why is USPS deactivated? https://www.openml.org/d/41070 @janvanrijn ?
Was deactivated by an automatic script that by that time deactivated all in_preparation
datasets (so that we could start the auto activation). I will remove the deactivation status and let it up to the evaluation engine to activate it (or not)
wait that means everything that was in preparation was put into deactivated? why? because of the huge flood of datasets? could we do this more selectively? It would be good to make sure we activate as many datasets as possible. It seems really weird to me to just deactivate all datasets before a particular date for no reason...
This was discussed in the appropriate email thread. I appreciate your concern, but personally I won't have time to maintain actual content on OpenML that goes beyond the policy "in case of doubt, deactivate / remove / redo" on top of my current maintenance responsibilities.
I would gladly give you the list of deactivated datasets of that moment if you are interested :)
fair. Sure, give it to me ;)
There's no way for me to rerun the evaluation engine on those, right?
Ugh uploaded sensorless under the wrong name: https://www.openml.org/d/42173 (deactivated now)
but then also under the right name: https://www.openml.org/d/42174
Found three bugs in openml/openml-python lol....
also see #9 #2
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#sector
Generally someone should check if openml has all the LibSVM datasets (if appropriate).