Standard procedures for dealing with missing values

jernsting commented 5 years ago

Hi, I have found a hint in datautils.py, that missing values are "masked" by default. But how are they masked?

Thank you in advance

sebhrusen commented 5 years ago

Hi @jernsting, great question! And it's a good occasion to document this aspect. Let me answer the following questions:

when does automlbenchmark (partially) handle missing values?
which values are considered as missing values?
what do we mean by "masking" missing values, and how does it work?

When does automlbenchmark handle missing values?

automlbenchmark makes the assumption that autoML frameworks will handle missing values by themselves, but not all of them – especially the ones on top of sklearn – accept "raw" data, e.g. categorical features. For the latter ones, the app encodes the categorical features, using label encoding by default, and that's where the question of missing values occurs automatically, and only in this case.

At the opposite, some autoML frameworks (AutoWEKA, H20AutoML, ...) don't require this data preparation, and the original openml arff file is passed directly to the framework.

Which values are considered as missing values?

we currently only support openml tasks, those provide datasets in ARFF format where missing values are represented as ? or an empty string. When loading those files the arff python parser will convert them to None. Then, when the framework asks for the "encoded" data (e.g. dataset.train.X_enc) the missing value will then appear as NaN.

What is masking and how does it work?

To be fair with frameworks that handle missing values by themselves, we don't remove missing values on our side, so masking is only an internal operation that occurs during the encoding step. Basically, for a given categorical feature we create a mask of all missing values (=None at that level), those are replaced by a value that can be encoded (empty string by default), then the encoding takes place, and the mask is applied again to convert the value corresponding to encoded(None) back to to NaN, as the result of the encoding is supposed to be a numeric value. If the feature is numerical, then only type encoding will take place and the None missing value will simply be converted to NaN.

I hope I've been clear enough. Please also note that some frameworks don't handle missing values at all, therefore imputation takes place in the framework's integration exec.py (see TPOT vs. autosklearn): by default missing values are imputed by the mean value.

An example of datasets containing missing values is eucalyptus in validation benchmark. You may want to play with it and put some logs on your local branch to see how this is handled in detail. Tip for a quick run (only fold 0 for 5mn max):

python runbenchmark.py RandomForest validation -f 0 -t eucalyptus -Xt.max_runtime_seconds=300

jernsting commented 5 years ago

Great, thanks a lot! Now I have seen in the exec.py, that the data is split up in training and test set. So I have one more question: How big are the sets?

sebhrusen commented 5 years ago

@jernsting sorry for late reply, but I don't have any short answer to your question. It goes from ~850x20 (instance*feat) for Vehicle (https://www.openml.org/d/54) to ~581,000x55 for Covertype (https://www.openml.org/d/1596)

I'm actually surprised we didn't write this information anywhere on the website or paper (https://openml.github.io/automlbenchmark/). Anyway, before we decide to automatically extract this information from the openml datasets and add it to the summaries (https://github.com/openml/automlbenchmark/tree/master/reports/tables), I'm afraid you have to go through the dataset list and check those properties directly on openml if you haven't done it already. The "fastest" is to check those ___8c4h_ref.csv files under https://github.com/openml/automlbenchmark/tree/master/reports, copy/paste the id in your browser (e.g. openml.org/t/7592), then click on the dataset used by the task (e.g. adult) to obtain those size properties (and much more).

jernsting commented 5 years ago

Ok, I maybe have to clarify my question: Do you use the cv-splits defined by the openml datasets?

This question is interesting due to the fact, that you use the auc-score. If I am simply using KFold to replicate some of the results, i often run into problems, because y only contains one class.

sebhrusen commented 5 years ago

Yes, we run the benchmark against the splits defined by the openML task (10 folds CV).

For TunedRandomForest, we also apply 5 fold CV on a given openML split and we never encountered similar issue as yours. As I understand you're using sklearn, maybe you should try with StratifiedKFold instead of KFold.

PGijsbers commented 5 years ago

If I am simply using KFold to replicate some of the results, i often run into problems, because y only contains one class.

Are you shuffling the data? By default KFold does not shuffle the data, so if the data is ordered by class it seems realistic it would lead to bad splits.

PGijsbers commented 5 years ago

Using StratifiedKFold like Seb mentioned would be preferred either way (browser is not letting me edit my previous comment).

PGijsbers commented 4 years ago

Closing this issue as the question is answered.

openml / automlbenchmark