we need to update our inclusion exclusion rules for data sets

berndbischl commented 6 years ago

we need to at least include

max nr of levels per cat feature <= 100

minimal class size in abs nr >= 20

berndbischl commented 6 years ago

and recheck whether our complete rule set holds at the end for the final set

berndbischl commented 6 years ago

simple data sets are handled here #6

mfeurer commented 6 years ago

data streams/non iid excluded

mfeurer commented 6 years ago

datasets without a clear description or non-public source

mfeurer commented 6 years ago

will be reflected in the outcome of issue #17.

janvanrijn commented 6 years ago

max nr of levels per cat feature <= 100

Can you elaborate on this one? Is there a reason to exclude these datasets?

joaquinvanschoren commented 6 years ago

Some algorithms can only handle numeric features. Using a standard one-hot-encoding, this would mean that any such feature is translated to 100+ new features, which sometimes brakes experiments.

janvanrijn commented 6 years ago

Isn't that something for the algorithm designer to keep in mind, rather than a benchmark criterion?

By adding this criterion we included another arbitrary split, i.e., why choice 100, and not 97, 73 or 48, which can all reasonably be justified with the fact that those are also high numbers for hot encoding.

mfeurer commented 6 years ago

Wasn't the issue that R can't handle some of those @berndbischl

joaquinvanschoren commented 6 years ago

IIRC, this is indeed an issue for R. But I guess also for scikit-learn?
Often, the algorithm designer cannot decide this. There are good reasons why scikit-learn is built on numpy arrays for instance. You can work around this (e.g. do some internal smart preprocessing, or just silently drop the feature inside the algorithm), but often designers don't want to dictate this.
We made these decisions for our 1st benchmark suite because it would make experiments easier, and hence increase adoption of standardized benchmarks. Otherwise, people will still find excuses to exclude some datasets, which makes comparisons harder again, or not use the benchmark suite at all.

mfeurer commented 6 years ago

In scikit-learn we one-hot-encode everything.

janvanrijn commented 6 years ago

Often, the algorithm designer cannot decide this. There are good reasons why scikit-learn is built on numpy arrays for instance. You can work around this (e.g. do some internal smart preprocessing, or just silently drop the feature inside the algorithm), but often designers don't want to dictate this.

I don't see how numpy has something to do with this? From a meta-learning / workflow designer POV, you would want to be able to decide based on the meta-knowledge to drop, skip hotencoding or use an alternative version of hotencoding.

If there is a hard limit on the number of attributes R can handle, i) what is this limit? ii) How do we with the current set of rules ensure that this limit is still not violated?

janvanrijn commented 6 years ago

Proposal (as discussed on Skype): We drop the 'max. number of nominal attribute values' for the following reason:

The original reason for putting on this constraint is that several algorithms in R try to solve a computational problem of 2^V, where V is the number of values for a given nominal attribute.
The upper limit of R for this value is 32, which would exclude many more datasets than is currently the case. Even when cutting this value down to 32, computationally it will take a lot of time.

Proposed is to use One-Hot-Encoding, which has different issues:

If a given attribute has many values, a (e.g.) Random Forest will overemphasize attributes derived from this attribute in the random set of attributes. No clear solution for this, but this is consistent with Weka and Scikit-Learn.

Additionally, we impose the following constraint on the max number of attributes.

Currently the constraint is: max 5000 attributes
This will change in: The One-hot-encoded dataset can not contain more than 5000 attributes.
Obviously, hotencoding does not happen server side, this is the responsibility (and decision) of the clients.

mfeurer commented 6 years ago

Current criteria for inclusion are:

>= 500 instances
<= 100000 instances
<= 5000 features
<= 5000 features after one-hot-encoding
at least two classes
MinorityClassSize / MajorityClassSize > 0.05

Current criteria for exclusion are:

artificial data
time series data
text data
multi-label data
subset of other data/derived data (such as binarized multiclass or regression data)
unclear target feature
unclear origin or source
grouped data
label leakage/too easy
too simple
sparseness

@berndbischl we dropped the exclusion criterion '100 levels' because it seemed arbitrary (arbitrary in the sense that this is not the threshold of one of R's algorithms for the amount of levels that can be handled according to @giuseppec ). Instead we introduced the different criterion of the total number of categories in the dataset.

giuseppec commented 6 years ago

How many datasets with more than '100 levels' do we have after applying these filter criteria? I could not see this in the notebook and, e.g. a data set with 10 features and 500 levels each might be bad.

giuseppec commented 6 years ago

minimal class size in abs nr >= 20

Currently, the tasks 41 9950 9954 9955 9956 146802 have less then 20 obs in minority class.

janvanrijn commented 6 years ago

FFR, this query reproduced your results:

SELECT t.task_id, d.did, d.name, dq.value AS `MinorityClassSize` FROM dataset d, data_quality dq, task_inputs t WHERE dq.data = d.did AND dq.quality = "MinorityClassSize" AND t.value = d.did AND t.input = "source_data" AND t.task_id IN (SELECT id FROM task_tag WHERE tag = "study_99") AND dq.value < 20 LIMIT 100

These have in common that they have many class-labels (19-100) and it would be a shame to drop them. The most extreme cases (collins - 6, soybean - 8) are used in many references. My suggestion would be, given the fact that the datasets that this applies to look like decent benchmark material, to not further incorporate this requirement.

giuseppec commented 6 years ago

If we keep them, we should check if the train-test splits do not contain any empty classes. We should use stratified 10 fold splits here, dont we? And if we use stratified 10-fold cv, this means that for the class with, say 20 instances we only predict for two instances right?

janvanrijn commented 6 years ago

Current verdict (Skype call @frank-hutter @giuseppec @mfeurer @janvanrijn) is leave these datasets in, as there are (due to stratified CV) always at least 5 observations per class to train on. We also have the ratio requirement.

janvanrijn commented 6 years ago

TODO: check the status of criteria in the paper (also with grouped data removed)

openml / benchmark-suites

we need to update our inclusion exclusion rules for data sets #2