MAPIE is not able to one-hot encode columns when estimator is an scikit learn pipeline with a preprocessor step

scikit-learn-contrib / MAPIE

A scikit-learn-compatible module to estimate prediction intervals and control risks based on conformal predictions.

https://mapie.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

1.3k stars 111 forks source link

MAPIE is not able to one-hot encode columns when estimator is an scikit learn pipeline with a preprocessor step #128

Closed edgBR closed 2 years ago

edgBR commented 2 years ago

Describe the bug Dear colleagues, I am creating a system to classify customers in 2 binary classes and then apply a regression model to one of the classes.

Some of my features are string that I obviously need to encode. In this case with one hot encoding.

To Reproduce

My code is as follows:

from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor, make_column_selector as selector 
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder 
from sklearn.impute import SimpleImputer
#from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.multioutput import RegressorChain, MultiOutputRegressor
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import mean_absolute_error, make_scorer, mean_tweedie_deviance, auc
from sklearn.model_selection import RandomizedSearchCV, train_test_split, LeaveOneGroupOut, LeavePGroupsOut, cross_validate
from sklearn.metrics import roc_auc_score, plot_roc_curve, roc_curve, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingRegressor, HistGradientBoostingClassifier

from sklearn import set_config
from mapie.regression import MapieRegressor

data_train, data_test, target_train, target_test = train_test_split(
    df.drop(columns=target_reg + target_class + METADATA_COLUMNS), 
    df[target_reg + target_class], 
    random_state=42)

categorical_columns_of_interest = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
numerical_columns = ml_data.drop(columns=target_reg + target_class + METADATA_COLUMNS).select_dtypes(include=np.number).columns
numerical_columns = [x for x in MY_FEATURES if x not in FEATURES_NOT_TO_IMPUTE]
numerical_columns = [x for x in numerical_columns if x not in categorical_columns_of_interest]

categorical_transformer = OneHotEncoder(handle_unknown="ignore")
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")), 
        ("scaler", StandardScaler()),
        ("variance_selector", VarianceThreshold(threshold=0.03))
        ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_only", numeric_transformer, numerical_columns),
        ("get_dummies", categorical_transformer, categorical_columns_of_interest)])

pipeline_hist_boost_reg= Pipeline([('preprocessor', preprocessor),
                             ('estimator', HistGradientBoostingRegressor())])

regressor = TransformedTargetRegressor(pipeline_hist_boost_reg, func=np.log1p, inverse_func=np.expm1)

mapie_estimator = MapieRegressor(pipeline_hist_boost_reg)
mapie_estimator.fit(data_train, target_train)

Expected behavior After this I will expect that I can run:

y_pred, y_pis = mapie_estimator.predict(data_test)

Screenshots

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-43-61f70ee71787> in <module>
      1 mapie_estimator = MapieRegressor(pipeline_hist_boost_reg)
----> 2 mapie_estimator.fit(X_train_reg, y_train_reg)

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/mapie/regression.py in fit(self, X, y, sample_weight)
    457         cv = self._check_cv(self.cv)
    458         estimator = self._check_estimator(self.estimator)
--> 459         X, y = check_X_y(
    460             X, y, force_all_finite=False, dtype=["float64", "int", "object"]
    461         )

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    962         raise ValueError("y cannot be None")
    963 
--> 964     X = check_array(
    965         X,
    966         accept_sparse=accept_sparse,

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    683     if has_pd_integer_array:
    684         # If there are any pandas integer extension arrays,
--> 685         array = array.astype(dtype)
    686 
    687     if force_all_finite not in (True, False, "allow-nan"):

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5804         else:
   5805             # else, only a single dtype is given
-> 5806             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5807             return self._constructor(new_data).__finalize__(self, method="astype")
   5808 

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    412 
    413     def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 414         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    415 
    416     def convert(

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    590         values = self.values
    591 
--> 592         new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    593 
    594         new_values = maybe_coerce_values(new_values)

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors)
   1298 
   1299     try:
-> 1300         new_values = astype_array(values, dtype, copy=copy)
   1301     except (ValueError, TypeError):
   1302         # e.g. astype_nansafe can fail on object-dtype of strings

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy)
   1246 
   1247     else:
-> 1248         values = astype_nansafe(values, dtype, copy=copy)
   1249 
   1250     # in pandas we don't store numpy str dtypes, so convert to object

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1083         flags = arr.flags
   1084         flat = arr.ravel("K")
-> 1085         result = astype_nansafe(flat, dtype, copy=copy, skipna=skipna)
   1086         order: Literal["C", "F"] = "F" if flags.f_contiguous else "C"
   1087         # error: Item "ExtensionArray" of "Union[ExtensionArray, ndarray]" has no

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1190     if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
   1191         # Explicit copy, or required since NumPy can't view from / to object.
-> 1192         return arr.astype(dtype, copy=True)
   1193 
   1194     return arr.astype(dtype, copy=copy)

ValueError: could not convert string to float: 'group C'

Being this value part of one of the categorical columns which its being encoded by the preprocessor.

When training the model without mapie everything works correctly:

Desktop (please complete the following information):

import platform
print(platform.machine())
print(platform.version())
print(platform.platform())
print(platform.system())
print(platform.processor())

x86_64
#58~18.04.1-Ubuntu SMP Wed Jul 28 23:14:18 UTC 2021
Linux-5.4.0-1056-azure-x86_64-with-glibc2.10
Linux
x86_64

Scikit learn dependencies:

scikit-learn==1.0.2
scikit-learn-intelex==2021.5.1
imbalance-learn==0.9
mapie==0.3.1

gmartinonQM commented 2 years ago

Hi @edgBR , and thanks for spotting this ! Could you provide a complete, minimal and reproducible example ? Typically with a dataframe of 3-4 hand-made lines and minimal number of lines of code (I think less than 10 would suffice).

edgBR commented 2 years ago

Hi @gmartinonQM unfortunately I can not share the data but I have tried with another dataset attached here: kaggle_dataset.csv

Which I got from this kaggle kernel:

https://www.kaggle.com/riantowibisono/loan-classification-machine-learning/notebook

After doing:

import numpy as nb
import pandas as pd
df = pd.read_csv('kaggle_dataset.csv')
df_1 = df.dropna(axis=1, thresh=int(0.70*len(df)))
df_1.head()
df_clean = df_1[[
    'loan_status_fullyPaid', 'term','int_rate',
    'installment','grade', 'annual_inc',
    'verification_status','dti'  # These features are just initial guess, you can try to choose any other combination
]].copy()
df_clean.head()
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()

df_clean['term'] = label.fit_transform(df_clean['term'])
df_clean['grade'] = label.fit_transform(df_clean['grade'])
df_clean['verification_status'] = label.fit_transform(df_clean['verification_status'])

x = df_clean.drop(['loan_status_fullyPaid'], axis=1)
y = df_clean['loan_status_fullyPaid']

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor, make_column_selector as selector 
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder 
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np 

categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories='auto')
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")), 
        ("scaler", StandardScaler()),
        ("variance_selector", VarianceThreshold(threshold=0.03))
        ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_only", numeric_transformer, [2,4,6]),
        ("get_dummies", categorical_transformer, [0,3,5])], 
        remainder = 'passthrough'                               
)

from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(
    x,
    y,
    test_size = .2
)

from sklearn.metrics import roc_auc_score, plot_roc_curve, roc_curve, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingRegressor, HistGradientBoostingClassifier

pipeline_hist_boost_clf = Pipeline([('preprocessor', preprocessor),
                             ('estimator', HistGradientBoostingClassifier())])

pipeline_hist_boost_clf.fit(xtr, ytr)

from mapie.classification import MapieClassifier

mapie_classifier = MapieClassifier(pipeline_hist_boost_clf)
mapie_classifier.fit(xtr, ytr)

This time with a classification model following the example of:

https://mapie.readthedocs.io/en/latest/examples_classification/plot_sadinle2019_example.html

I get a similar error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-27-b25355019e0d> in <module>
      2 
      3 mapie_classifier = MapieClassifier(pipeline_hist_boost_clf)
----> 4 mapie_classifier.fit(xtr, ytr)

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/mapie/classification.py in fit(self, X, y, sample_weight)
    511             X, y, force_all_finite=False, dtype=["float64", "int", "object"]
    512         )
--> 513         assert type_of_target(y) == "multiclass"
    514         self.n_features_in_ = check_n_features_in(X, cv, estimator)
    515         sample_weight, X, y = check_null_weight(sample_weight, X, y)

AssertionError:

It seems that this force_all_finite is the main issue. I will be happy to contribute to the debugging, if needed I could offer myself for a teams call.

BR E

gmartinonQM commented 2 years ago

Thanks again @edgBR ! Actually, this is not a minimal example. There are two efficient ways of solving the issue :

You provide a minimal working example, with only necessary code to reproduce the error, and I do the debugging.
You have already identified the problem in the source code of MAPIE and you can create a pull request linked to this issue

Which option would you prefer ?

For option 1., could you create a toy dataset (3-4 lines, 2-3 columns) and juste create a minimalistic scikit-learn pipeline reproducing the bug ? This could help create a unit test in the future in order to ensure non-regression of the bug fix.

edgBR commented 2 years ago

Hi @gmartinonQM,

Unfortunately my knowledge of scikit-learn pipelines is not that great yet (I was using R an recipes in the past and my scikit learn usages was limited to the classical way, a.k.a no pipelines objects).

Therfore I will prefer option 1. I will attach the example dataset in a couple of hours and I will update the original bug issue.

BR E

edgBR commented 2 years ago

Hi @gmartinonQM

Minimal example below:

import pandas as pd
test_df = pd.DataFrame({'loan_status_fullyPaid':[1,1,1,0],
                        'term':['0','1','0','1'],
                        'int_rate': [19.20,19.99,6.49,30.94],
                        'installment':[739.74, 233.10, 597.57, 426.49],
                        'grade':['3','4','0','6'],
                        'annual_inc': [45000,66000,125000,36000],
                        'verification_status': ['2','2','0','2'], 
                        'dti':[10.16,10.95,6.57,18.19]
                        })
x = test_df.drop(['loan_status_fullyPaid'], axis=1)
y = test_df['loan_status_fullyPaid']

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor, make_column_selector as selector 
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder 
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np 

categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories='auto')
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")), 
        ("scaler", StandardScaler()),
        ("variance_selector", VarianceThreshold(threshold=0.03))
        ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_only", numeric_transformer, [2,4,6]),
        ("get_dummies", categorical_transformer, [0,3,5])], 
        remainder = 'passthrough'                               
)

from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(
    x,
    y,
    test_size = .1
)

from sklearn.ensemble import HistGradientBoostingClassifier

pipeline_hist_boost_clf = Pipeline([('preprocessor', preprocessor),
                             ('estimator', HistGradientBoostingClassifier())])

pipeline_hist_boost_clf.fit(xtr, ytr)

from mapie.classification import MapieClassifier

mapie_classifier = MapieClassifier(pipeline_hist_boost_clf)
mapie_classifier.fit(xtr, ytr)

edgBR commented 2 years ago

Hi @gmartinonQM

Did you manage to get the a hint of why this could be happening?

gmartinonQM commented 2 years ago

Hi @edgBR ,

here is a minimal working example, that abstracts away all your use case particularities :

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB

from mapie.classification import MapieClassifier

np.random.seed(2)
n = 20
x = pd.DataFrame(
    {
        "x_cat": np.random.choice(["A", "B", "C"], size=n),
        "x_num": np.random.randn(n)
    }
)
y = pd.Series(np.random.choice([0, 1, 2], size=n))

categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories="auto")
preprocessor = ColumnTransformer(
    transformers=[
        ("get_dummies", categorical_transformer, [0])
    ], 
    remainder="passthrough"                               
)

estimator = GaussianNB()
model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("estimator", estimator)
    ]
)

mapie_classifier = MapieClassifier(model)
mapie_classifier.fit(x, y)

This code executes perfectly right, so I cannot reproduce your bug.

At this point, few comments may be useful for you :

In your example, your target is binary. MAPIE is not suited for binary classification, only for multi-class classification. Only in this setting does the notion of "prediction set" makes sense. For binary notion of uncertainty, refer to binary calibration : https://scikit-learn.org/stable/modules/calibration.html
Note that this is why I used a natively multiclass estimator GaussianNB instead of a gradient boosting classifier only suited for binary classification
In your second code example, the error you get is different from the original one you mentionned in the issue. This is just an assert in MAPIE code checking that we are indeed in the multiclass settings. In the logs, you can read :

AssertionError: assert type_of_target(y) == "multiclass"

If you think we should iterate further and change the code base all the same, please start from the minimal working example I just provided to expose your diagnostic.

Happy to help, and feel free to ask other questions.

edgBR commented 2 years ago

Dear @gmartinonQM

The information regarding the classifier is clear and understood but still I am not able to make it work with the regression use case. I have created another example that is showing a similar error than my original issue (being not able to encode columns):

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import fetch_openml
from mapie.regression import MapieRegressor

df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df_train, df_test = train_test_split(df, test_size=0.33, random_state=0)

log_scale_transformer = make_pipeline(
    FunctionTransformer(np.log, validate=False), StandardScaler()
)

model_preprocessor = ColumnTransformer(
    [
        ("passthrough_numeric", "passthrough", ["BonusMalus"]),
        ("binned_numeric", KBinsDiscretizer(n_bins=10), ["VehAge", "DrivAge"]),
        ("log_scaled_numeric", log_scale_transformer, ["Density"]),
        (
            "categorical",
            OrdinalEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        )
    ],
    remainder="drop",
)
poisson_gbrt = Pipeline(
    [
        ("preprocessor", model_preprocessor),
        (
            "regressor",
            HistGradientBoostingRegressor(loss="poisson", max_leaf_nodes=128),
        ),
    ]
)

mapie = MapieRegressor(poisson_gbrt)
mapie.fit(
    df_train, df_train["Frequency"]
)

Error as follows:

ValueError                                Traceback (most recent call last)
<ipython-input-34-5fa6526911b2> in <module>
----> 1 mapie.fit(
      2     df_train, df_train["Frequency"]
      3 )

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/mapie/regression.py in fit(self, X, y, sample_weight)
    457         cv = self._check_cv(self.cv)
    458         estimator = self._check_estimator(self.estimator)
--> 459         X, y = check_X_y(
    460             X, y, force_all_finite=False, dtype=["float64", "int", "object"]
    461         )

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    962         raise ValueError("y cannot be None")
    963 
--> 964     X = check_array(
    965         X,
    966         accept_sparse=accept_sparse,

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744                     array = array.astype(dtype, casting="unsafe", copy=False)
    745                 else:
--> 746                     array = np.asarray(array, order=order, dtype=dtype)
    747             except ComplexWarning as complex_warning:
    748                 raise ValueError(

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
    100         return _asarray_with_like(a, dtype=dtype, order=order, like=like)
    101 
--> 102     return array(a, dtype, copy=False, order=order)
    103 
    104 

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/generic.py in __array__(self, dtype)
   1991 
   1992     def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993         return np.asarray(self._values, dtype=dtype)
   1994 
   1995     def __array_wrap__(

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
    100         return _asarray_with_like(a, dtype=dtype, order=order, like=like)
    101 
--> 102     return array(a, dtype, copy=False, order=order)
    103 
    104 

ValueError: could not convert string to float: 'A'

edgBR commented 2 years ago

@gmartinonQM

Here you can see that if I encode the values first mapie works just fine:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import Ridge
from mapie.regression import MapieRegressor

from sklearn.datasets import fetch_openml
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df_final = pd.concat([df.drop(columns=["VehBrand", "VehGas", "Region", "Area"]), 
pd.get_dummies(df[["VehBrand", "VehGas", "Region", "Area"]], sparse=True)], axis=1)
df_train, df_test = train_test_split(df_final, test_size=0.33, random_state=0)

hist_reg = Pipeline(
    [
        ("regressor", HistGradientBoostingRegressor(loss='poisson')),
    ]
)
from mapie.regression import MapieRegressor
mapie = MapieRegressor(hist_reg)
mapie.fit(
    df_train, df_train["Frequency"]
)

fjpa121197 commented 2 years ago

Hi, where you able to solve this?

I'm having a similar problem when making use of ColumnTransformer and Pipeline. Not really an expert in pipelines, but I have the following set up of transformation methods:

numeric_preprocessor = Pipeline(
    steps = [
        ('imputation_mean', SimpleImputer(strategy='mean')),
        ('scaler', RobustScaler())
    ]
)

categorical_preprocessor = Pipeline(
    steps = [
        ('inputation_mode', SimpleImputer(strategy='most_frequent')),
        ('ohe',OneHotEncoder(handle_unknown='ignore'))

    ]
)

preprocessor = ColumnTransformer([
                                    ('cat_preprocessor', categorical_preprocessor, vars_cat),
                                    ('num_preprocessor', numeric_preprocessor, vars_num)])

catboost_pipe = make_pipeline(preprocessor, CatBoostRegressor(random_state=123, verbose=0))

catboost_pipe.fit(train_df[vars_cat + vars_num], train_df[target])

This part of the code works fine. However, when I try to use MapieRegressor as shown in the quick start tutorial:

mapie = MapieRegressor(catboost_pipe)
mapie.fit(train_df[vars_cat + vars_num], train_df[target])

I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    424         try:
--> 425             all_columns = X.columns
    426         except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
C:\Users\FRANCI~1.PAR\AppData\Local\Temp/ipykernel_25908/3938012699.py in <module>
      1 mapie = MapieRegressor(catboost_pipe)
----> 2 mapie.fit(train_df[vars_cat + vars_num], train_df[target])

~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight)
    489             self.n_samples_val_ = [X.shape[0]]
    490         else:
--> 491             self.single_estimator_ = fit_estimator(
    492                 clone(estimator), X, y, sample_weight
    493             )

~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\utils.py in fit_estimator(estimator, X, y, sample_weight)
    112         estimator.fit(X, y, sample_weight=sample_weight)
    113     else:
--> 114         estimator.fit(X, y)
    115     return estimator
    116 

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
    328         """
    329         fit_params_steps = self._check_fit_params(**fit_params)
--> 330         Xt = self._fit(X, y, **fit_params_steps)
    331         with _print_elapsed_time('Pipeline',
    332                                  self._log_message(len(self.steps) - 1)):

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
    290                 cloned_transformer = clone(transformer)
    291             # Fit or load from cache the current transformer
--> 292             X, fitted_transformer = fit_transform_one_cached(
    293                 cloned_transformer, X, y, None,
    294                 message_clsname='Pipeline',

~\Anaconda3\envs\ktp_explore\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    738     with _print_elapsed_time(message_clsname, message):
    739         if hasattr(transformer, 'fit_transform'):
--> 740             res = transformer.fit_transform(X, y, **fit_params)
    741         else:
    742             res = transformer.fit(X, y, **fit_params).transform(X)

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
    527         self._validate_transformers()
    528         self._validate_column_callables(X)
--> 529         self._validate_remainder(X)
    530 
    531         result = self._fit_transform(X, y, _fit_transform_one)

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_remainder(self, X)
    325         cols = []
    326         for columns in self._columns:
--> 327             cols.extend(_get_column_indices(X, columns))
    328 
    329         remaining_idx = sorted(set(range(self._n_features)) - set(cols))

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    425             all_columns = X.columns
    426         except AttributeError:
--> 427             raise ValueError("Specifying the columns using strings is only "
    428                              "supported for pandas DataFrames")
    429         if isinstance(key, str):

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Is this an error on my code? Not sure because catboost_pipe.fit(train_df[vars_cat + vars_num], train_df[target]) worked as expected.

Thanks for the help!

fjpa121197 commented 2 years ago

Hi @edgBR ,

here is a minimal working example, that abstracts away all your use case particularities :
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB

from mapie.classification import MapieClassifier

np.random.seed(2)
n = 20
x = pd.DataFrame(
    {
        "x_cat": np.random.choice(["A", "B", "C"], size=n),
        "x_num": np.random.randn(n)
    }
)
y = pd.Series(np.random.choice([0, 1, 2], size=n))

categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories="auto")
preprocessor = ColumnTransformer(
    transformers=[
        ("get_dummies", categorical_transformer, [0])
    ], 
    remainder="passthrough"                               
)

estimator = GaussianNB()
model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("estimator", estimator)
    ]
)

mapie_classifier = MapieClassifier(model)
mapie_classifier.fit(x, y)
This code executes perfectly right, so I cannot reproduce your bug.

At this point, few comments may be useful for you :

In your example, your target is binary. MAPIE is not suited for binary classification, only for multi-class classification. Only in this setting does the notion of "prediction set" makes sense. For binary notion of uncertainty, refer to binary calibration : https://scikit-learn.org/stable/modules/calibration.html

Note that this is why I used a natively multiclass estimator GaussianNB instead of a gradient boosting classifier only suited for binary classification

In your second code example, the error you get is different from the original one you mentionned in the issue. This is just an assert in MAPIE code checking that we are indeed in the multiclass settings. In the logs, you can read :
AssertionError: assert type_of_target(y) == "multiclass"
If you think we should iterate further and change the code base all the same, please start from the minimal working example I just provided to expose your diagnostic.

Happy to help, and feel free to ask other questions.

Hi @gmartinonQM , would you be able to show/give an example similar to these, but with more than 1 categorical feature? And I see that in the ColumnTransformer section, you refer to the categorical column as [0] right?

Is it possible to refer to it with their name? For example: ['column_1, column_2, etc].

I have made this small dataframe/example with some sample values and 2 categorical features, 1 numerical and target column, alongside two different transformations that are used within ColumnTransformer and with CatBoostRegressor as the predictor.

# Sample df
sample_df = x = pd.DataFrame(
    {
        "x_cat_1": ['type_1', 'type_2', 'type_3','type_1', 'type_2',np.nan],
        "x_cat_2": ['size_1', 'size_1', 'size_2',np.nan,'size_1', 'size_2'],
        "x_num_1": [0,1,1,4,np.nan,5],
        'target': [5,7,3,9,10,8]
    }
)
vars_cat = ['x_cat_1','x_cat_2']
vars_num = ['x_num_1']
target = ['target']

# Transformations
numeric_preprocessor = Pipeline(
    steps = [
        ('imputation_mean', SimpleImputer(strategy='mean')),
        ('scaler', RobustScaler())
    ]
)

categorical_preprocessor = Pipeline(
    steps = [
        ('imputation_mode', SimpleImputer(strategy='most_frequent')),
        ('ohe',OneHotEncoder(handle_unknown='ignore'))
    ]
)

# Define a column transformer that uses your previous transformers and applies them to specific columns defined.
preprocessor = ColumnTransformer([
                                    ('cat_preprocessor', categorical_preprocessor, vars_cat),
                                    ('num_preprocessor', numeric_preprocessor, vars_num)])

# Pipeline
catboost_pipe = make_pipeline(preprocessor, CatBoostRegressor(random_state=123, verbose=0))

# MAPIE Regressor
mapie = MapieRegressor(catboost_pipe)
mapie.fit(sample_df[vars_cat + vars_num], sample_df[target])

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    424         try:
--> 425             all_columns = X.columns
    426         except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
C:\Users\FRANCI~1.PAR\AppData\Local\Temp/ipykernel_25908/3044912112.py in <module>
      1 mapie = MapieRegressor(catboost_pipe)
----> 2 mapie.fit(sample_df[vars_cat + vars_num], sample_df[target])

~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight)
    489             self.n_samples_val_ = [X.shape[0]]
    490         else:
--> 491             self.single_estimator_ = fit_estimator(
    492                 clone(estimator), X, y, sample_weight
    493             )

~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\utils.py in fit_estimator(estimator, X, y, sample_weight)
    112         estimator.fit(X, y, sample_weight=sample_weight)
    113     else:
--> 114         estimator.fit(X, y)
    115     return estimator
    116 

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
    328         """
    329         fit_params_steps = self._check_fit_params(**fit_params)
--> 330         Xt = self._fit(X, y, **fit_params_steps)
    331         with _print_elapsed_time('Pipeline',
    332                                  self._log_message(len(self.steps) - 1)):

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
    290                 cloned_transformer = clone(transformer)
    291             # Fit or load from cache the current transformer
--> 292             X, fitted_transformer = fit_transform_one_cached(
    293                 cloned_transformer, X, y, None,
    294                 message_clsname='Pipeline',

~\Anaconda3\envs\ktp_explore\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    738     with _print_elapsed_time(message_clsname, message):
    739         if hasattr(transformer, 'fit_transform'):
--> 740             res = transformer.fit_transform(X, y, **fit_params)
    741         else:
    742             res = transformer.fit(X, y, **fit_params).transform(X)

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
    527         self._validate_transformers()
    528         self._validate_column_callables(X)
--> 529         self._validate_remainder(X)
    530 
    531         result = self._fit_transform(X, y, _fit_transform_one)

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_remainder(self, X)
    325         cols = []
    326         for columns in self._columns:
--> 327             cols.extend(_get_column_indices(X, columns))
    328 
    329         remaining_idx = sorted(set(range(self._n_features)) - set(cols))

~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
    425             all_columns = X.columns
    426         except AttributeError:
--> 427             raise ValueError("Specifying the columns using strings is only "
    428                              "supported for pandas DataFrames")
    429         if isinstance(key, str):

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

fjpa121197 commented 2 years ago

@gmartinonQM

Here you can see that if I encode the values first mapie works just fine:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import Ridge
from mapie.regression import MapieRegressor

from sklearn.datasets import fetch_openml
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df_final = pd.concat([df.drop(columns=["VehBrand", "VehGas", "Region", "Area"]), 
pd.get_dummies(df[["VehBrand", "VehGas", "Region", "Area"]], sparse=True)], axis=1)
df_train, df_test = train_test_split(df_final, test_size=0.33, random_state=0)

hist_reg = Pipeline(
    [
        ("regressor", HistGradientBoostingRegressor(loss='poisson')),
    ]
)
from mapie.regression import MapieRegressor
mapie = MapieRegressor(hist_reg)
mapie.fit(
    df_train, df_train["Frequency"]
)

Were you able to solve this by not transforming/encoding explicitly before using MAPIE?

edgBR commented 2 years ago

Hi @fjpa121197

Edit:

No, I have noticed that if I refer the columns by name for any preprocessing step I am getting the same error than you.

fjpa121197 commented 2 years ago

Hi @fjpa121197

Edit:

No, I have noticed that if I refer the columns by name for any preprocessing step I am getting the same error than you.

Hi @edgBR, thanks for letting me know. I'm waiting for @gmartinonQM, maybe we are refering to the columns in a wrong way.

Thanks anyway!

gmartinonQM commented 2 years ago

Hi @edgBR @fjpa121197 , indeed, all your problems have the same cause : MAPIE, at some point, require that input data is (or is convertible to) a numpy array. When using pipelines based on column names, a bug comes out because the names disappear during the conversion. I have begun a pull request about this : https://github.com/scikit-learn-contrib/MAPIE/pull/136

This fixes the issue for regression (not for classification yet), but breaks scikit-learn estimator compatibility. Some additional work is needed to make all unit tests passing.

Feel free to suggest changes to this pull request. In the meantime, I will try to converge to a solution during next weeks.

edgBR commented 2 years ago

Hi @fjpa121197

A dirty trick to bypass this:

Create a pipeline with you preprocessing steps.
Call fit_transform()
Get as results a numpy array.
Run MAPIE
Run pipeline.transform in test set
Run MAPIE.predict() in test set

This seems a point of friction and I am not sure if pipeline.transform() is transforming the data with the right parameters (like if we have a simple mean imputer I am not sure if it is transforming with the mean of the training set or if it is refitting again).

Anyhow, great hopes for the fix @gmartinonQM.

fjpa121197 commented 2 years ago

Hi @edgBR

I did this, and I was able to use MAPIE without a problem. When calling pipeline.transform() on the test set, it should not refit the transformers again.

Thanks for helping with these @gmartinonQM, looking forward to these fix! Thanks in advance!

gmartinonQM commented 2 years ago

Good news @fjpa121197 @edgBR , I have managed to resolve all side effects and unit tests. The linked PR will be merged soon, and the bug will disappear in next MAPIE release

fjpa121197 commented 2 years ago

Thanks @gmartinonQM!