Open lennartvandeguchte opened 3 weeks ago
I managed to resolve the sklearn issue when using the 'mondrian_numpy' strategy in the example above by using indices in the ColumnTransformer instead of column names:
numerical_indices = [data.columns.get_loc(col) for col in numeric_features]
categorical_indices = [data.columns.get_loc(col) for col in categorical_features]
ct = ColumnTransformer([
("site", OneHotEncoder(), categorical_indices),
("features", RobustScaler(), numerical_indices),
])
I don't know if the package maintainers still want the MondrianCP class to handle Pandas dataframes? Otherwise this issue can be closed.
Hi @lennartvandeguchte, thank you for reporting this. Good to know you found a workaround.
We need further internal discussion to decide what to do about this. We'll let you know.
Best,
Following our discussion: support for Pandas dataframes is something we'd like to have, but is not a quick win. Indeed, in a prefit setting, it is easy to address, but in a split or cross setting, we call .fit on the provided estimator (that can be a pipeline), and so we need to avoid casting X,y to NDArray otherwise we're losing some pd.Dataframe functionalities that can be required by the pipeline.
We're adding this to our backlog.
Describe the bug
When using the new MondrianCP class I'm unable to fit my estimator with a Pandas dataframe, while using the standard MapieRegressor this works fine. Since I'm using a sklearn pipeline that contains some column transformers that use the pandas column name, I can't transform my data into a numpy array first because then sklearn gives me an error when fitting the estimator.
To Reproduce Below the code to reproduce my problem.
By changing the strategy to mondrian_numpy you can also reproduce the sklearn error I receive.
Expected behavior Be able to use a Pandas dataframe as input data for MondrianCP class.