vanderschaarlab / autoprognosis

A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.
https://www.autoprognosis.vanderschaar-lab.com/
Apache License 2.0
114 stars 26 forks source link

StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

Closed yvchao closed 1 year ago

yvchao commented 2 years ago

Describe the bug

When the optimal imputer is selected by Adjutorium, StackingEnsemble and AggregatingEnsemble failed due to missing data checking in upstream implementation.

Example to reproduce

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sys
import random
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from adjutorium.studies.classifiers import ClassifierStudy
from adjutorium.utils.serialization import load_model_from_file, save_model_to_file
from adjutorium.utils.tester import evaluate_estimator
import adjutorium.logger as log

X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

# Simulate missingness
total_len = len(X)

for col in ["mean texture", "mean compactness"]:
    indices = random.sample(range(0, total_len), 10)
    X.loc[indices, col] = np.nan

dataset = X.copy()
dataset["target"] = Y

workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)

study_name = "classification_example_imputation"

study = ClassifierStudy(
    study_name=study_name,
    dataset=dataset,
    target="target",
    num_iter=1,
    num_study_iter=1,
    timeout=1, 
    imputers = ["mean", "ice", "median"],
    classifiers=["logistic_regression", "lda"],
    feature_scaling = [], # feature preprocessing is disabled
    score_threshold=0.4,
    workspace=workspace,
)

log.add(sys.stderr,level = 'INFO')

study.run()

Result

Information below can be found in the log.

...
[2022-05-25T16:53:43.553802+0100][45426][INFO] StackingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
[2022-05-25T16:53:43.579949+0100][45426][INFO] AggregatingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
...

Note

This is due to the input validation in the upstream module combo

    ...
    def fit(self, X, y):
        """Fit classifier.
        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.
        y : numpy array of shape (n_samples,), optional (default=None)
            The ground truth of the input samples (labels).
        """

        # Validate inputs X and y
        X, y = check_X_y(X, y)
        X = check_array(X)
        self._set_n_classes(y)
    ...

The StackingEnsemble and AggregatingEnsemble crash at this line even though the imputer is included in the pipeline. The input data should be imputed before provided to these ensembles. Alternatively, this behavior could be overrode with a customized implementation.