Closed berndbischl closed 6 years ago
current, imperfect results are here http://rpubs.com/giuseppec/OpenML100
@berndbischl @joaquinvanschoren Could you please put the definition here which we came up with last time?
I think it was fitting a RandomForest per feature and checking if that has accuracy 100%
@berndbischl @joaquinvanschoren was that the rule?
We agreed: a dataset is too simple if: You can build any tree with a single feature that has perfect accuracy under stratified 10-fold-CV
We agreed: a dataset is too simple if: You can build any tree with a single feature that has perfect accuracy under stratified 10-fold-CV
not sure if i find that "enough" but go ahead.
Do you have another practical test we could add?
well, what i would do first is run a few very simple models on all data sets and look at the distribution of results. i find that more eplorative.
that would be:
i guess.
problem with the current rule above might be: what if the simple model reaches 99.9 % (and nothing else is better - ok granted the 2nd point you only get from a real benchmark then)
another thing is that IF openml works properly, writing such a benchmark should be a matter of minutes. so its also a good litmus test for the project and suite itself. and i already wrote that code, multiple times now. its just that i always run into some annoying technical problems (that really should not be there anymore)
OK, so if we run a few simple models and one of them gets more than 99% performance, should we exclude that dataset?
But something explorative is not a rule we could state in a paper, right?
@berndbischl "another thing is that IF openml works properly, writing such a benchmark should be a matter of minutes."
Yup :) -> https://github.com/openml/Study-14/blob/master/OpenML%20Benchmark%20generator.ipynb
Well, it actually took me a few hours, and then I needed about half a day to double-check some of the 'new' datasets that it returned. Most checks are instantaneous, only the checking for the 'too easy' datasets takes a few hours because it needs to build lots of models.
Let me know if you want to add more tests for 'too easy' datasets.
My proposal for the easyness test would be: The dataset should not be perfectly described by at most X features. (X can be set to one. )
In practise, the notebook does this. The datasets that you filtered out will also be filtered out by this test. However, this can be achieved in another way as well:
1) Replace Random Forest by a Decision Stump 2) Train the decision stump on the whole dataset and score it on the whole dataset 3) Make the exclusion criterion accuracy == 1.00 (instead of accuracy >= 0.99). (N.B. this specific change removes the mushroom dataset from the exclusion list, as it is not dependent on a single attribute)
In my opinion, this has the following advantages: 1) This does not require on any form of arbitrary splitting (i.e., no cross-validation splits) 2) There is no arbitrary threshold on a score that we include (1.00 seems much more defendable than 0.99) 3) There is no arbitrary choice for a classifier. 4) Easier to justify in the paper, as there are no arbitrary choices here 5) The datasets that come out of this check are the same as for the RF test (i.e., irish, cjs) (6) No real argument, but in runs in several minutes)
A decision stump makes only 1 binary split in scikit-learn. So, if the target can be perfectly predicted by a 3-way split (e.g. for a 3-class problem), the decision stump is not sufficient.
Hmm, that is kind of a problem. I will provide custom code that alleviates this problem
The datasets that you filtered out will also be filtered out by this test.
I was mistaken: Irish and cjs get a significantly lower score on the decision stump on train data. No dataset obtains a 100% score. I will (re)do decision tree on 1 att / all atts and update tomorrow.
I just ran an additional triviality check, the following datasets get a perfect score by either a decision tree or logistic regression (cross-validation score): Mushroom
and cardiotocography
. A decision stump gets a perfect score on musk
(trainset) and as pointed out by Joaquin irish
and cjs
can be perfectly classified by a sinlge feature (random forest). Leaves the list of currently undisputed datasets to the following (82):
{3: 'kr-vs-kp',
6: 'letter',
11: 'balance-scale',
12: 'mfeat-factors',
14: 'mfeat-fourier',
15: 'breast-w',
16: 'mfeat-karhunen',
18: 'mfeat-morphological',
22: 'mfeat-zernike',
23: 'cmc',
28: 'optdigits',
29: 'credit-approval',
31: 'credit-g',
32: 'pendigits',
37: 'diabetes',
38: 'sick',
42: 'soybean',
44: 'spambase',
46: 'splice',
50: 'tic-tac-toe',
54: 'vehicle',
60: 'waveform-5000',
151: 'electricity',
182: 'satimage',
188: 'eucalyptus',
300: 'isolet',
307: 'vowel',
377: 'synthetic_control',
469: 'analcatdata_dmft',
554: 'mnist_784',
1038: 'gina_agnostic',
1049: 'pc4',
1050: 'pc3',
1053: 'jm1',
1063: 'kc2',
1067: 'kc1',
1068: 'pc1',
1120: 'MagicTelescope',
1461: 'bank-marketing',
1462: 'banknote-authentication',
1464: 'blood-transfusion-service-center',
1468: 'cnae-9',
1475: 'first-order-theorem-proving',
1478: 'har',
1480: 'ilpd',
1485: 'madelon',
1486: 'nomao',
1487: 'ozone-level-8hr',
1489: 'phoneme',
1491: 'one-hundred-plants-margin',
1492: 'one-hundred-plants-shape',
1493: 'one-hundred-plants-texture',
1494: 'qsar-biodeg',
1497: 'wall-robot-navigation',
1501: 'semeion',
1510: 'wdbc',
1515: 'micro-mass',
1590: 'adult',
4134: 'Bioresponse',
4534: 'PhishingWebsites',
4538: 'GesturePhaseSegmentationProcessed',
6332: 'cylinder-bands',
23381: 'dresses-sales',
23512: 'higgs',
23517: 'numerai28.6',
40499: 'texture',
40536: 'SpeedDating',
40668: 'connect-4',
40670: 'dna',
40701: 'churn',
40705: 'tokyo1',
40923: 'Devnagari-Script',
40966: 'MiceProtein',
40971: 'collins',
40979: 'mfeat-pixel',
40981: 'Australian',
40982: 'steel-plates-fault',
40983: 'wilt',
40984: 'segment',
40994: 'climate-model-simulation-crashes',
40996: 'Fashion-MNIST',
41027: 'jungle_chess_2pcs_raw_endgame_complete'}
Having the following changes to our list in de document:
New ones!
Devnagari-Script
Fashion-MNIST
churn
cjs
credit-approval
dna
irish
jungle_chess_2pcs_raw_endgame_complete
numerai28.6
synthetic_control
tokyo1
Dropped ones!
Internet-Advertisements
ada_agnostic
car
cardiotocography
cjs
credit-a
eeg-eye-state
irish
mushroom
sylva_agnostic
Current definition is:
which is now also reflected in the latest version of the paper.
TODO: put rule nr 3 back in notebook
current suggestion
run 1nn, NB and rpart, normally crossvalidated on the tasks as OML tells us missing values imputed by: num --> median, cat --> new level measure balanced error rate
if BER >= 0.99 for any classifier --> remove data from suite