openml / benchmark-suites

6 stars 3 forks source link

define rules to filter out data which is too simple #6

Closed berndbischl closed 6 years ago

berndbischl commented 6 years ago

current suggestion

run 1nn, NB and rpart, normally crossvalidated on the tasks as OML tells us missing values imputed by: num --> median, cat --> new level measure balanced error rate

if BER >= 0.99 for any classifier --> remove data from suite

berndbischl commented 6 years ago

current, imperfect results are here http://rpubs.com/giuseppec/OpenML100

mfeurer commented 6 years ago

@berndbischl @joaquinvanschoren Could you please put the definition here which we came up with last time?

mfeurer commented 6 years ago

I think it was fitting a RandomForest per feature and checking if that has accuracy 100%

mfeurer commented 6 years ago

@berndbischl @joaquinvanschoren was that the rule?

joaquinvanschoren commented 6 years ago

We agreed: a dataset is too simple if: You can build any tree with a single feature that has perfect accuracy under stratified 10-fold-CV

berndbischl commented 6 years ago

We agreed: a dataset is too simple if: You can build any tree with a single feature that has perfect accuracy under stratified 10-fold-CV

not sure if i find that "enough" but go ahead.

joaquinvanschoren commented 6 years ago

Do you have another practical test we could add?

berndbischl commented 6 years ago

well, what i would do first is run a few very simple models on all data sets and look at the distribution of results. i find that more eplorative.

that would be:

i guess.

problem with the current rule above might be: what if the simple model reaches 99.9 % (and nothing else is better - ok granted the 2nd point you only get from a real benchmark then)

berndbischl commented 6 years ago

another thing is that IF openml works properly, writing such a benchmark should be a matter of minutes. so its also a good litmus test for the project and suite itself. and i already wrote that code, multiple times now. its just that i always run into some annoying technical problems (that really should not be there anymore)

joaquinvanschoren commented 6 years ago

OK, so if we run a few simple models and one of them gets more than 99% performance, should we exclude that dataset?

mfeurer commented 6 years ago

But something explorative is not a rule we could state in a paper, right?

joaquinvanschoren commented 6 years ago

@berndbischl "another thing is that IF openml works properly, writing such a benchmark should be a matter of minutes."

Yup :) -> https://github.com/openml/Study-14/blob/master/OpenML%20Benchmark%20generator.ipynb

Well, it actually took me a few hours, and then I needed about half a day to double-check some of the 'new' datasets that it returned. Most checks are instantaneous, only the checking for the 'too easy' datasets takes a few hours because it needs to build lots of models.

joaquinvanschoren commented 6 years ago

Let me know if you want to add more tests for 'too easy' datasets.

janvanrijn commented 6 years ago

My proposal for the easyness test would be: The dataset should not be perfectly described by at most X features. (X can be set to one. )

In practise, the notebook does this. The datasets that you filtered out will also be filtered out by this test. However, this can be achieved in another way as well:

1) Replace Random Forest by a Decision Stump 2) Train the decision stump on the whole dataset and score it on the whole dataset 3) Make the exclusion criterion accuracy == 1.00 (instead of accuracy >= 0.99). (N.B. this specific change removes the mushroom dataset from the exclusion list, as it is not dependent on a single attribute)

In my opinion, this has the following advantages: 1) This does not require on any form of arbitrary splitting (i.e., no cross-validation splits) 2) There is no arbitrary threshold on a score that we include (1.00 seems much more defendable than 0.99) 3) There is no arbitrary choice for a classifier. 4) Easier to justify in the paper, as there are no arbitrary choices here 5) The datasets that come out of this check are the same as for the RF test (i.e., irish, cjs) (6) No real argument, but in runs in several minutes)

joaquinvanschoren commented 6 years ago

A decision stump makes only 1 binary split in scikit-learn. So, if the target can be perfectly predicted by a 3-way split (e.g. for a 3-class problem), the decision stump is not sufficient.

janvanrijn commented 6 years ago

Hmm, that is kind of a problem. I will provide custom code that alleviates this problem

janvanrijn commented 6 years ago

The datasets that you filtered out will also be filtered out by this test.

I was mistaken: Irish and cjs get a significantly lower score on the decision stump on train data. No dataset obtains a 100% score. I will (re)do decision tree on 1 att / all atts and update tomorrow.

janvanrijn commented 6 years ago

I just ran an additional triviality check, the following datasets get a perfect score by either a decision tree or logistic regression (cross-validation score): Mushroom and cardiotocography. A decision stump gets a perfect score on musk (trainset) and as pointed out by Joaquin irish and cjs can be perfectly classified by a sinlge feature (random forest). Leaves the list of currently undisputed datasets to the following (82):

{3: 'kr-vs-kp',
 6: 'letter',
 11: 'balance-scale',
 12: 'mfeat-factors',
 14: 'mfeat-fourier',
 15: 'breast-w',
 16: 'mfeat-karhunen',
 18: 'mfeat-morphological',
 22: 'mfeat-zernike',
 23: 'cmc',
 28: 'optdigits',
 29: 'credit-approval',
 31: 'credit-g',
 32: 'pendigits',
 37: 'diabetes',
 38: 'sick',
 42: 'soybean',
 44: 'spambase',
 46: 'splice',
 50: 'tic-tac-toe',
 54: 'vehicle',
 60: 'waveform-5000',
 151: 'electricity',
 182: 'satimage',
 188: 'eucalyptus',
 300: 'isolet',
 307: 'vowel',
 377: 'synthetic_control',
 469: 'analcatdata_dmft',
 554: 'mnist_784',
 1038: 'gina_agnostic',
 1049: 'pc4',
 1050: 'pc3',
 1053: 'jm1',
 1063: 'kc2',
 1067: 'kc1',
 1068: 'pc1',
 1120: 'MagicTelescope',
 1461: 'bank-marketing',
 1462: 'banknote-authentication',
 1464: 'blood-transfusion-service-center',
 1468: 'cnae-9',
 1475: 'first-order-theorem-proving',
 1478: 'har',
 1480: 'ilpd',
 1485: 'madelon',
 1486: 'nomao',
 1487: 'ozone-level-8hr',
 1489: 'phoneme',
 1491: 'one-hundred-plants-margin',
 1492: 'one-hundred-plants-shape',
 1493: 'one-hundred-plants-texture',
 1494: 'qsar-biodeg',
 1497: 'wall-robot-navigation',
 1501: 'semeion',
 1510: 'wdbc',
 1515: 'micro-mass',
 1590: 'adult',
 4134: 'Bioresponse',
 4534: 'PhishingWebsites',
 4538: 'GesturePhaseSegmentationProcessed',
 6332: 'cylinder-bands',
 23381: 'dresses-sales',
 23512: 'higgs',
 23517: 'numerai28.6',
 40499: 'texture',
 40536: 'SpeedDating',
 40668: 'connect-4',
 40670: 'dna',
 40701: 'churn',
 40705: 'tokyo1',
 40923: 'Devnagari-Script',
 40966: 'MiceProtein',
 40971: 'collins',
 40979: 'mfeat-pixel',
 40981: 'Australian',
 40982: 'steel-plates-fault',
 40983: 'wilt',
 40984: 'segment',
 40994: 'climate-model-simulation-crashes',
 40996: 'Fashion-MNIST',
 41027: 'jungle_chess_2pcs_raw_endgame_complete'}

Having the following changes to our list in de document:

New ones!
   Devnagari-Script
   Fashion-MNIST
   churn
   cjs
   credit-approval
   dna
   irish
   jungle_chess_2pcs_raw_endgame_complete
   numerai28.6
   synthetic_control
   tokyo1
Dropped ones!
   Internet-Advertisements
   ada_agnostic
   car
   cardiotocography
   cjs
   credit-a
   eeg-eye-state
   irish
   mushroom
   sylva_agnostic
mfeurer commented 6 years ago

Current definition is:

which is now also reflected in the latest version of the paper.

janvanrijn commented 6 years ago

TODO: put rule nr 3 back in notebook