scikit-learn-contrib / MAPIE

A scikit-learn-compatible module to estimate prediction intervals and control risks based on conformal predictions.
https://mapie.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
1.3k stars 111 forks source link

MondrianCP can't handle Pandas dataframe #526

Open lennartvandeguchte opened 3 weeks ago

lennartvandeguchte commented 3 weeks ago

Describe the bug

When using the new MondrianCP class I'm unable to fit my estimator with a Pandas dataframe, while using the standard MapieRegressor this works fine. Since I'm using a sklearn pipeline that contains some column transformers that use the pandas column name, I can't transform my data into a numpy array first because then sklearn gives me an error when fitting the estimator.

To Reproduce Below the code to reproduce my problem.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor, ColumnTransformer
from sklearn.preprocessing import  RobustScaler, OneHotEncoder
from mapie.regression import MapieRegressor
from mapie.mondrian import MondrianCP
from lightgbm import LGBMRegressor
import pandas as pd
from sklearn.model_selection import train_test_split

# Create some dummy data
data = pd.DataFrame(np.random.rand(100, 5), columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
data['categorical_feature'] = np.random.choice(['A', 'B', 'C'], size=100)
y = pd.Series(np.random.rand(100))

# Create bins for the partition
data['BIN'] = pd.cut(y, bins=3, labels=[1, 2, 3])

# Split the data into a train and calibration set
data_train, data_calib, y_train, y_calib = train_test_split(data, y, test_size=0.2, random_state=42)

model = LGBMRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_child_samples=10,
    num_leaves=31,
    random_state=42
)

ct = ColumnTransformer([
    ("site", OneHotEncoder(), ['categorical_feature']),
    ("features", RobustScaler(), ['feature1', 'feature2', 'feature3', 'feature4', 'feature5']),
    ])
estimators = [('transformers',ct), ('model',  model)]
pre_pipe = Pipeline(estimators)
pipe = TransformedTargetRegressor(regressor=pre_pipe, transformer=RobustScaler())
pipe.fit(data_train, y_train)

strategy = "mondrian"
if strategy == "mondrian":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib, y_calib, partition=data_calib['BIN'])
if strategy == "mondrian_numpy":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib.to_numpy(), y_calib, partition=data_calib['BIN'])
else:
    mapie_regressor = MapieRegressor(estimator=pipe, cv='prefit')
    mapie_regressor = mapie_regressor.fit(data_calib, y_calib)

By changing the strategy to mondrian_numpy you can also reproduce the sklearn error I receive.

Expected behavior Be able to use a Pandas dataframe as input data for MondrianCP class.

lennartvandeguchte commented 3 weeks ago

I managed to resolve the sklearn issue when using the 'mondrian_numpy' strategy in the example above by using indices in the ColumnTransformer instead of column names:

numerical_indices = [data.columns.get_loc(col) for col in numeric_features]
categorical_indices = [data.columns.get_loc(col) for col in categorical_features]

ct = ColumnTransformer([
    ("site", OneHotEncoder(), categorical_indices),
    ("features", RobustScaler(), numerical_indices),
    ])

I don't know if the package maintainers still want the MondrianCP class to handle Pandas dataframes? Otherwise this issue can be closed.

Valentin-Laurent commented 3 weeks ago

Hi @lennartvandeguchte, thank you for reporting this. Good to know you found a workaround.

We need further internal discussion to decide what to do about this. We'll let you know.

Best,

Valentin-Laurent commented 3 weeks ago

Following our discussion: support for Pandas dataframes is something we'd like to have, but is not a quick win. Indeed, in a prefit setting, it is easy to address, but in a split or cross setting, we call .fit on the provided estimator (that can be a pipeline), and so we need to avoid casting X,y to NDArray otherwise we're losing some pd.Dataframe functionalities that can be required by the pipeline.

We're adding this to our backlog.