openml / benchmark-suites

7 stars 3 forks source link

we need to update our inclusion exclusion rules for data sets #2

Closed berndbischl closed 6 years ago

berndbischl commented 6 years ago

we need to at least include

max nr of levels per cat feature <= 100

minimal class size in abs nr >= 20

berndbischl commented 6 years ago

and recheck whether our complete rule set holds at the end for the final set

berndbischl commented 6 years ago

simple data sets are handled here #6

mfeurer commented 6 years ago

data streams/non iid excluded

mfeurer commented 6 years ago

datasets without a clear description or non-public source

mfeurer commented 6 years ago

will be reflected in the outcome of issue #17.

janvanrijn commented 6 years ago

max nr of levels per cat feature <= 100

Can you elaborate on this one? Is there a reason to exclude these datasets?

joaquinvanschoren commented 6 years ago

Some algorithms can only handle numeric features. Using a standard one-hot-encoding, this would mean that any such feature is translated to 100+ new features, which sometimes brakes experiments.

janvanrijn commented 6 years ago

Isn't that something for the algorithm designer to keep in mind, rather than a benchmark criterion?

By adding this criterion we included another arbitrary split, i.e., why choice 100, and not 97, 73 or 48, which can all reasonably be justified with the fact that those are also high numbers for hot encoding.

mfeurer commented 6 years ago

Wasn't the issue that R can't handle some of those @berndbischl

joaquinvanschoren commented 6 years ago
mfeurer commented 6 years ago

In scikit-learn we one-hot-encode everything.

janvanrijn commented 6 years ago

Often, the algorithm designer cannot decide this. There are good reasons why scikit-learn is built on numpy arrays for instance. You can work around this (e.g. do some internal smart preprocessing, or just silently drop the feature inside the algorithm), but often designers don't want to dictate this.

I don't see how numpy has something to do with this? From a meta-learning / workflow designer POV, you would want to be able to decide based on the meta-knowledge to drop, skip hotencoding or use an alternative version of hotencoding.

If there is a hard limit on the number of attributes R can handle, i) what is this limit? ii) How do we with the current set of rules ensure that this limit is still not violated?

janvanrijn commented 6 years ago

Proposal (as discussed on Skype): We drop the 'max. number of nominal attribute values' for the following reason:

Proposed is to use One-Hot-Encoding, which has different issues:

Additionally, we impose the following constraint on the max number of attributes.

mfeurer commented 6 years ago

Current criteria for inclusion are:

Current criteria for exclusion are:

@berndbischl we dropped the exclusion criterion '100 levels' because it seemed arbitrary (arbitrary in the sense that this is not the threshold of one of R's algorithms for the amount of levels that can be handled according to @giuseppec ). Instead we introduced the different criterion of the total number of categories in the dataset.

giuseppec commented 6 years ago

How many datasets with more than '100 levels' do we have after applying these filter criteria? I could not see this in the notebook and, e.g. a data set with 10 features and 500 levels each might be bad.

giuseppec commented 6 years ago

minimal class size in abs nr >= 20

Currently, the tasks 41 9950 9954 9955 9956 146802 have less then 20 obs in minority class.

janvanrijn commented 6 years ago

FFR, this query reproduced your results:

SELECT t.task_id, d.did, d.name, dq.value AS `MinorityClassSize` FROM dataset d, data_quality dq, task_inputs t WHERE dq.data = d.did AND dq.quality = "MinorityClassSize" AND t.value = d.did AND t.input = "source_data" AND t.task_id IN (SELECT id FROM task_tag WHERE tag = "study_99") AND dq.value < 20 LIMIT 100

These have in common that they have many class-labels (19-100) and it would be a shame to drop them. The most extreme cases (collins - 6, soybean - 8) are used in many references. My suggestion would be, given the fact that the datasets that this applies to look like decent benchmark material, to not further incorporate this requirement.

giuseppec commented 6 years ago

If we keep them, we should check if the train-test splits do not contain any empty classes. We should use stratified 10 fold splits here, dont we? And if we use stratified 10-fold cv, this means that for the class with, say 20 instances we only predict for two instances right?

janvanrijn commented 6 years ago

Current verdict (Skype call @frank-hutter @giuseppec @mfeurer @janvanrijn) is leave these datasets in, as there are (due to stratified CV) always at least 5 observations per class to train on. We also have the ratio requirement.

janvanrijn commented 6 years ago

TODO: check the status of criteria in the paper (also with grouped data removed)