Closed berndbischl closed 6 years ago
and recheck whether our complete rule set holds at the end for the final set
simple data sets are handled here #6
data streams/non iid excluded
datasets without a clear description or non-public source
will be reflected in the outcome of issue #17.
max nr of levels per cat feature <= 100
Can you elaborate on this one? Is there a reason to exclude these datasets?
Some algorithms can only handle numeric features. Using a standard one-hot-encoding, this would mean that any such feature is translated to 100+ new features, which sometimes brakes experiments.
Isn't that something for the algorithm designer to keep in mind, rather than a benchmark criterion?
By adding this criterion we included another arbitrary split, i.e., why choice 100, and not 97, 73 or 48, which can all reasonably be justified with the fact that those are also high numbers for hot encoding.
Wasn't the issue that R can't handle some of those @berndbischl
In scikit-learn we one-hot-encode everything.
Often, the algorithm designer cannot decide this. There are good reasons why scikit-learn is built on numpy arrays for instance. You can work around this (e.g. do some internal smart preprocessing, or just silently drop the feature inside the algorithm), but often designers don't want to dictate this.
I don't see how numpy has something to do with this? From a meta-learning / workflow designer POV, you would want to be able to decide based on the meta-knowledge to drop, skip hotencoding or use an alternative version of hotencoding.
If there is a hard limit on the number of attributes R can handle, i) what is this limit? ii) How do we with the current set of rules ensure that this limit is still not violated?
Proposal (as discussed on Skype): We drop the 'max. number of nominal attribute values' for the following reason:
Proposed is to use One-Hot-Encoding, which has different issues:
Additionally, we impose the following constraint on the max number of attributes.
Current criteria for inclusion are:
Current criteria for exclusion are:
@berndbischl we dropped the exclusion criterion '100 levels' because it seemed arbitrary (arbitrary in the sense that this is not the threshold of one of R's algorithms for the amount of levels that can be handled according to @giuseppec ). Instead we introduced the different criterion of the total number of categories in the dataset.
How many datasets with more than '100 levels' do we have after applying these filter criteria? I could not see this in the notebook and, e.g. a data set with 10 features and 500 levels each might be bad.
minimal class size in abs nr >= 20
Currently, the tasks 41 9950 9954 9955 9956 146802 have less then 20 obs in minority class.
FFR, this query reproduced your results:
SELECT t.task_id, d.did, d.name, dq.value AS `MinorityClassSize` FROM dataset d, data_quality dq, task_inputs t WHERE dq.data = d.did AND dq.quality = "MinorityClassSize" AND t.value = d.did AND t.input = "source_data" AND t.task_id IN (SELECT id FROM task_tag WHERE tag = "study_99") AND dq.value < 20 LIMIT 100
These have in common that they have many class-labels (19-100) and it would be a shame to drop them. The most extreme cases (collins - 6, soybean - 8) are used in many references. My suggestion would be, given the fact that the datasets that this applies to look like decent benchmark material, to not further incorporate this requirement.
If we keep them, we should check if the train-test splits do not contain any empty classes. We should use stratified 10 fold splits here, dont we? And if we use stratified 10-fold cv, this means that for the class with, say 20 instances we only predict for two instances right?
Current verdict (Skype call @frank-hutter @giuseppec @mfeurer @janvanrijn) is leave these datasets in, as there are (due to stratified CV) always at least 5 observations per class to train on. We also have the ratio requirement.
TODO: check the status of criteria in the paper (also with grouped data removed)
we need to at least include
max nr of levels per cat feature <= 100
minimal class size in abs nr >= 20