Question Paper - Githubissues

Hello, I didn't quite understand this excerpt from the article: "Instead, we impute missing values with the constant predictor, or prior. This baseline returns the empirical class distribution for classification and the empirical mean for regression. This is a very penalizing imputation strategy, as the constant predictor is often much worse than results obtained by the AutoML frameworks that produce predictions for the task or fold. However, we feel this penalty for ill-behaved systems is appropriate and fairer towards the well-behaved frameworks and hope that it encourages a standard of robust, well-behaved AutoML frameworks." Could you explain better how this imputation was performed? Thank you in advance. I'm analyzing my experiments, and it returned missing values in terms of failures. I wanted to know better how you handled it.

When running experiments, sometimes AutoML frameworks experience failures. You may encounter something like this in your results file (simplified for readability):

id, task, framework, constraint, fold, type, result, metric, ..., info ,...
openml.org/t/359991, kick, NaiveAutoML, 1h8c_gp3, 1, binary, , auc, ..., CalledProcessError: Command '/bench/frameworks/NaiveAutoML/venv/bin/python -W ignore /bench/frameworks/NaiveAutoML/exec.py' returned non-zero exit status 137., ...

Here, NaiveAutoML for whatever reason failed to create predictions and no result is available. However, when we compare results across tasks and folds, such as when creating the critical difference diagrams, we need some performance measure. Concretely, when creating critical difference plots we first rank each framework by their mean score on each task. Now we have to decide how we calculate a mean score for NaiveAutoML on kick even though it crashed one (or more) time. We decided that we first impute such missing results with the score obtained by a constant predictor, and only then calculate the mean for NaiveAutoML on kick.

To get scores for our constant predictor, we use scikit-learn's Dummy Classifier and Dummy Regressor. It simply predicts the mean response (regression) or the empirical class probabilities of the training data (classification). We run that on each (task, fold). We can then find the score of the constant predictor on fold 1 of kick:

openml.org/t/359991,kick,constantpredictor,1h8c_gp3,1,binary,0.5,auc,local,0.24.2,,2.0.5,2022-02-09T20:04:52,0.2,0.02,0.0001,1.0,1435973,,0.876969,0.5,0.37292,0.5,,,,

The constant predictor has an AUC of 0.5 for fold 1 of kick (as expected), so now we impute the result of the Naive AutoML experiment on fold 1 of kick as if it obtained an AUC of 0.5, and use that in calculating the mean score for Naive AutoML on kick. Now that we have a mean score for NaiveAutoML on kick, we can compare it to mean scores of other AutoML frameworks on kick and so analysis as usual. Hope that clears things up!

openml / automlbenchmark

Question Paper #582