Closed jernsting closed 4 years ago
Hi @jernsting, great question! And it's a good occasion to document this aspect. Let me answer the following questions:
automlbenchmark
(partially) handle missing values?automlbenchmark
makes the assumption that autoML frameworks will handle missing values by themselves, but not all of them – especially the ones on top of sklearn
– accept "raw" data, e.g. categorical features. For the latter ones, the app encodes the categorical features, using label encoding by default, and that's where the question of missing values occurs automatically, and only in this case.
At the opposite, some autoML frameworks (AutoWEKA, H20AutoML, ...) don't require this data preparation, and the original openml
arff file is passed directly to the framework.
we currently only support openml
tasks, those provide datasets in ARFF format where missing values are represented as ?
or an empty string. When loading those files the arff python parser will convert them to None. Then, when the framework asks for the "encoded" data (e.g. dataset.train.X_enc
) the missing value will then appear as NaN.
To be fair with frameworks that handle missing values by themselves, we don't remove missing values on our side, so masking is only an internal operation that occurs during the encoding step.
Basically, for a given categorical feature we create a mask of all missing values (=None
at that level), those are replaced by a value that can be encoded (empty string by default), then the encoding takes place, and the mask is applied again to convert the value corresponding to encoded(None)
back to to NaN
, as the result of the encoding is supposed to be a numeric value.
If the feature is numerical, then only type encoding will take place and the None
missing value will simply be converted to NaN
.
I hope I've been clear enough.
Please also note that some frameworks don't handle missing values at all, therefore imputation takes place in the framework's integration exec.py
(see TPOT
vs. autosklearn
): by default missing values are imputed by the mean value.
An example of datasets containing missing values is eucalyptus
in validation
benchmark.
You may want to play with it and put some logs on your local branch to see how this is handled in detail.
Tip for a quick run (only fold 0 for 5mn max):
python runbenchmark.py RandomForest validation -f 0 -t eucalyptus -Xt.max_runtime_seconds=300
Great, thanks a lot! Now I have seen in the exec.py, that the data is split up in training and test set. So I have one more question: How big are the sets?
@jernsting sorry for late reply, but I don't have any short answer to your question. It goes from ~850x20 (instance*feat) for Vehicle (https://www.openml.org/d/54) to ~581,000x55 for Covertype (https://www.openml.org/d/1596)
I'm actually surprised we didn't write this information anywhere on the website or paper (https://openml.github.io/automlbenchmark/).
Anyway, before we decide to automatically extract this information from the openml
datasets and add it to the summaries (https://github.com/openml/automlbenchmark/tree/master/reports/tables), I'm afraid you have to go through the dataset list and check those properties directly on openml if you haven't done it already.
The "fastest" is to check those ___8c4h_ref.csv
files under https://github.com/openml/automlbenchmark/tree/master/reports
, copy/paste the id in your browser (e.g. openml.org/t/7592), then click on the dataset used by the task (e.g. adult) to obtain those size properties (and much more).
Ok, I maybe have to clarify my question: Do you use the cv-splits defined by the openml datasets?
This question is interesting due to the fact, that you use the auc-score. If I am simply using KFold to replicate some of the results, i often run into problems, because y only contains one class.
Yes, we run the benchmark against the splits defined by the openML task (10 folds CV).
For TunedRandomForest
, we also apply 5 fold CV on a given openML split and we never encountered similar issue as yours. As I understand you're using sklearn
, maybe you should try with StratifiedKFold
instead of KFold
.
If I am simply using KFold to replicate some of the results, i often run into problems, because y only contains one class.
Are you shuffling the data? By default KFold
does not shuffle the data, so if the data is ordered by class it seems realistic it would lead to bad splits.
Using StratifiedKFold
like Seb mentioned would be preferred either way (browser is not letting me edit my previous comment).
Closing this issue as the question is answered.
Hi, I have found a hint in datautils.py, that missing values are "masked" by default. But how are they masked?
Thank you in advance