openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

add dataset: sector #7

Open amueller opened 6 years ago

amueller commented 6 years ago

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#sector

Generally someone should check if openml has all the LibSVM datasets (if appropriate).

amueller commented 6 years ago

It looks like "sensorless" is also missing, even though it's on UCI: https://archive.ics.uci.edu/ml/datasets/dataset+for+sensorless+drive+diagnosis

[done]

I kinda thought we had all UCI datasets?

both sector and sensorless seem interesting benchmark datasets for something like CC18 (though sector has probably too many classes?)

amueller commented 6 years ago

RCV1 is also missing? https://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+RCV2+Multilingual,+Multiview+Text+Categorization+Test+collection

There's only a binarized version, it should have 53 classes...

No wonder we have a hard time coming up with 100 datasets if all the standard datasets are missing ;)

amueller commented 6 years ago

And USPS is missing? https://www.otexts.org/1577

amueller commented 6 years ago

@joaquinvanschoren did something ever happen to the talks with MLData? They have all of these.

We should at least make an effort to get the datasets that are very popular on mldata: http://mldata.org/repository/data/by_downloads/

@janvanrijn maybe you can work on the python uploading interface to make this easy?

joaquinvanschoren commented 6 years ago

I’ve emailed mikio again this morning. Waiting for his reply. He has a database dump but still needs to extract the dataset info. On Tue, 3 Apr 2018 at 19:42, Andreas Mueller notifications@github.com wrote:

@joaquinvanschoren https://github.com/joaquinvanschoren did something ever happen to the talks with MLData? They have all of these.

We should at least make an effort to get the datasets that are very popular on mldata: http://mldata.org/repository/data/by_downloads/

@janvanrijn https://github.com/janvanrijn maybe you can work on the python uploading interface to make this easy?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/701#issuecomment-378354586, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV1DL-uBVARHqvyL5ocbLJdotJxlSks5tk8KAgaJpZM4TFkIL .

-- Thank you, Joaquin

amueller commented 6 years ago

I don't want to delay the stuff on CC18, but don't you think adding these would be helpful? I feel a bit odd about not including standard datasets simply because no-one has uploaded them (because uploading is too hard ;)

amueller commented 6 years ago

Actually MLData provides ARFF files for most datasets, so uploading wouldn't even be that hard.

joaquinvanschoren commented 6 years ago

The problem is that they have no index, you have to scrape the website to get the dataset names and info, and the website is not very responsive right now. On Tue, 3 Apr 2018 at 20:35, Andreas Mueller notifications@github.com wrote:

Actually MLData provides ARFF files for most datasets, so uploading wouldn't even be that hard.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/701#issuecomment-378369964, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV-hepbDiGOhhpA8MfaB8tEiE8uuJks5tk873gaJpZM4TFkIL .

-- Thank you, Joaquin

amueller commented 6 years ago

Depends on whether you want all of them or only the popular ones. From the site I linked above you can easily get the datasets I mentioned and a couple more that look interesting.

amueller commented 6 years ago

Doesn't seem so hard: https://www.openml.org/d/41070

but I feel like maybe you can find a student to do that?

amueller commented 6 years ago

(the arff listed the class as "continuous" and called it "int0", though. Is there a way to fix that in OpenML or do I need to manually edit the ARFF?

joaquinvanschoren commented 6 years ago

Right now you still need to edit the ARFF files manually.

On Tue, 3 Apr 2018 at 21:36 Andreas Mueller notifications@github.com wrote:

(the arff listed the class as "continuous" and called it "int0", though. Is there a way to fix that in OpenML or do I need to manually edit the ARFF?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/701#issuecomment-378390031, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV145-aCE3c_Hm7ZADuMYRtFItQcKks5tk906gaJpZM4TFkIL .

-- Thank you, Joaquin

amueller commented 6 years ago

a bunch of the datasets I mentioned are not on MLData, though, but they are on UCI. I had kind of assumed we had most of the UCI datasets and maybe we can have someone look into that?

amueller commented 5 years ago

This dataset seems interesting and is also missing: https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification#

amueller commented 5 years ago

Many of these would qualify for cc-18 btw, or cc-20 or whatever.

amueller commented 5 years ago

Why is USPS deactivated? https://www.openml.org/d/41070 @janvanrijn ?

janvanrijn commented 5 years ago

Was deactivated by an automatic script that by that time deactivated all in_preparation datasets (so that we could start the auto activation). I will remove the deactivation status and let it up to the evaluation engine to activate it (or not)

amueller commented 5 years ago

wait that means everything that was in preparation was put into deactivated? why? because of the huge flood of datasets? could we do this more selectively? It would be good to make sure we activate as many datasets as possible. It seems really weird to me to just deactivate all datasets before a particular date for no reason...

janvanrijn commented 5 years ago

This was discussed in the appropriate email thread. I appreciate your concern, but personally I won't have time to maintain actual content on OpenML that goes beyond the policy "in case of doubt, deactivate / remove / redo" on top of my current maintenance responsibilities.

I would gladly give you the list of deactivated datasets of that moment if you are interested :)

amueller commented 5 years ago

fair. Sure, give it to me ;)

amueller commented 5 years ago

There's no way for me to rerun the evaluation engine on those, right?

amueller commented 5 years ago

Ugh uploaded sensorless under the wrong name: https://www.openml.org/d/42173 (deactivated now)

but then also under the right name: https://www.openml.org/d/42174

Found three bugs in openml/openml-python lol....

amueller commented 5 years ago

also see #9 #2