When the optimal imputer is selected by Adjutorium, StackingEnsemble and AggregatingEnsemble failed due to missing data checking in upstream implementation.
Example to reproduce
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sys
import random
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
from adjutorium.studies.classifiers import ClassifierStudy
from adjutorium.utils.serialization import load_model_from_file, save_model_to_file
from adjutorium.utils.tester import evaluate_estimator
import adjutorium.logger as log
X, Y = load_breast_cancer(return_X_y=True, as_frame=True)
# Simulate missingness
total_len = len(X)
for col in ["mean texture", "mean compactness"]:
indices = random.sample(range(0, total_len), 10)
X.loc[indices, col] = np.nan
dataset = X.copy()
dataset["target"] = Y
workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)
study_name = "classification_example_imputation"
study = ClassifierStudy(
study_name=study_name,
dataset=dataset,
target="target",
num_iter=1,
num_study_iter=1,
timeout=1,
imputers = ["mean", "ice", "median"],
classifiers=["logistic_regression", "lda"],
feature_scaling = [], # feature preprocessing is disabled
score_threshold=0.4,
workspace=workspace,
)
log.add(sys.stderr,level = 'INFO')
study.run()
Result
Information below can be found in the log.
...
[2022-05-25T16:53:43.553802+0100][45426][INFO] StackingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
[2022-05-25T16:53:43.579949+0100][45426][INFO] AggregatingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
...
Note
This is due to the input validation in the upstream module combo
...
def fit(self, X, y):
"""Fit classifier.
Parameters
----------
X : numpy array of shape (n_samples, n_features)
The input samples.
y : numpy array of shape (n_samples,), optional (default=None)
The ground truth of the input samples (labels).
"""
# Validate inputs X and y
X, y = check_X_y(X, y)
X = check_array(X)
self._set_n_classes(y)
...
The StackingEnsemble and AggregatingEnsemble crash at this line even though the imputer is included in the pipeline.
The input data should be imputed before provided to these ensembles. Alternatively, this behavior could be overrode with a customized implementation.
Describe the bug
When the optimal imputer is selected by Adjutorium, StackingEnsemble and AggregatingEnsemble failed due to missing data checking in upstream implementation.
Example to reproduce
Result
Information below can be found in the log.
Note
This is due to the input validation in the upstream module combo
The StackingEnsemble and AggregatingEnsemble crash at this line even though the imputer is included in the pipeline. The input data should be imputed before provided to these ensembles. Alternatively, this behavior could be overrode with a customized implementation.