Closed edgBR closed 2 years ago
Hi @edgBR , and thanks for spotting this ! Could you provide a complete, minimal and reproducible example ? Typically with a dataframe of 3-4 hand-made lines and minimal number of lines of code (I think less than 10 would suffice).
Hi @gmartinonQM unfortunately I can not share the data but I have tried with another dataset attached here: kaggle_dataset.csv
Which I got from this kaggle kernel:
https://www.kaggle.com/riantowibisono/loan-classification-machine-learning/notebook
After doing:
import numpy as nb
import pandas as pd
df = pd.read_csv('kaggle_dataset.csv')
df_1 = df.dropna(axis=1, thresh=int(0.70*len(df)))
df_1.head()
df_clean = df_1[[
'loan_status_fullyPaid', 'term','int_rate',
'installment','grade', 'annual_inc',
'verification_status','dti' # These features are just initial guess, you can try to choose any other combination
]].copy()
df_clean.head()
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
df_clean['term'] = label.fit_transform(df_clean['term'])
df_clean['grade'] = label.fit_transform(df_clean['grade'])
df_clean['verification_status'] = label.fit_transform(df_clean['verification_status'])
x = df_clean.drop(['loan_status_fullyPaid'], axis=1)
y = df_clean['loan_status_fullyPaid']
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor, make_column_selector as selector
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np
categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories='auto')
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="mean")),
("scaler", StandardScaler()),
("variance_selector", VarianceThreshold(threshold=0.03))
]
)
preprocessor = ColumnTransformer(
transformers=[
("numeric_only", numeric_transformer, [2,4,6]),
("get_dummies", categorical_transformer, [0,3,5])],
remainder = 'passthrough'
)
from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(
x,
y,
test_size = .2
)
from sklearn.metrics import roc_auc_score, plot_roc_curve, roc_curve, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingRegressor, HistGradientBoostingClassifier
pipeline_hist_boost_clf = Pipeline([('preprocessor', preprocessor),
('estimator', HistGradientBoostingClassifier())])
pipeline_hist_boost_clf.fit(xtr, ytr)
from mapie.classification import MapieClassifier
mapie_classifier = MapieClassifier(pipeline_hist_boost_clf)
mapie_classifier.fit(xtr, ytr)
This time with a classification model following the example of:
https://mapie.readthedocs.io/en/latest/examples_classification/plot_sadinle2019_example.html
I get a similar error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-27-b25355019e0d> in <module>
2
3 mapie_classifier = MapieClassifier(pipeline_hist_boost_clf)
----> 4 mapie_classifier.fit(xtr, ytr)
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/mapie/classification.py in fit(self, X, y, sample_weight)
511 X, y, force_all_finite=False, dtype=["float64", "int", "object"]
512 )
--> 513 assert type_of_target(y) == "multiclass"
514 self.n_features_in_ = check_n_features_in(X, cv, estimator)
515 sample_weight, X, y = check_null_weight(sample_weight, X, y)
AssertionError:
It seems that this force_all_finite is the main issue. I will be happy to contribute to the debugging, if needed I could offer myself for a teams call.
BR E
Thanks again @edgBR ! Actually, this is not a minimal example. There are two efficient ways of solving the issue :
Which option would you prefer ?
For option 1., could you create a toy dataset (3-4 lines, 2-3 columns) and juste create a minimalistic scikit-learn pipeline reproducing the bug ? This could help create a unit test in the future in order to ensure non-regression of the bug fix.
Hi @gmartinonQM,
Unfortunately my knowledge of scikit-learn pipelines is not that great yet (I was using R an recipes in the past and my scikit learn usages was limited to the classical way, a.k.a no pipelines objects).
Therfore I will prefer option 1. I will attach the example dataset in a couple of hours and I will update the original bug issue.
BR E
Hi @gmartinonQM
Minimal example below:
import pandas as pd
test_df = pd.DataFrame({'loan_status_fullyPaid':[1,1,1,0],
'term':['0','1','0','1'],
'int_rate': [19.20,19.99,6.49,30.94],
'installment':[739.74, 233.10, 597.57, 426.49],
'grade':['3','4','0','6'],
'annual_inc': [45000,66000,125000,36000],
'verification_status': ['2','2','0','2'],
'dti':[10.16,10.95,6.57,18.19]
})
x = test_df.drop(['loan_status_fullyPaid'], axis=1)
y = test_df['loan_status_fullyPaid']
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor, make_column_selector as selector
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np
categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories='auto')
numeric_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="mean")),
("scaler", StandardScaler()),
("variance_selector", VarianceThreshold(threshold=0.03))
]
)
preprocessor = ColumnTransformer(
transformers=[
("numeric_only", numeric_transformer, [2,4,6]),
("get_dummies", categorical_transformer, [0,3,5])],
remainder = 'passthrough'
)
from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(
x,
y,
test_size = .1
)
from sklearn.ensemble import HistGradientBoostingClassifier
pipeline_hist_boost_clf = Pipeline([('preprocessor', preprocessor),
('estimator', HistGradientBoostingClassifier())])
pipeline_hist_boost_clf.fit(xtr, ytr)
from mapie.classification import MapieClassifier
mapie_classifier = MapieClassifier(pipeline_hist_boost_clf)
mapie_classifier.fit(xtr, ytr)
Hi @gmartinonQM
Did you manage to get the a hint of why this could be happening?
Hi @edgBR ,
here is a minimal working example, that abstracts away all your use case particularities :
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB
from mapie.classification import MapieClassifier
np.random.seed(2)
n = 20
x = pd.DataFrame(
{
"x_cat": np.random.choice(["A", "B", "C"], size=n),
"x_num": np.random.randn(n)
}
)
y = pd.Series(np.random.choice([0, 1, 2], size=n))
categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories="auto")
preprocessor = ColumnTransformer(
transformers=[
("get_dummies", categorical_transformer, [0])
],
remainder="passthrough"
)
estimator = GaussianNB()
model = Pipeline(
[
("preprocessor", preprocessor),
("estimator", estimator)
]
)
mapie_classifier = MapieClassifier(model)
mapie_classifier.fit(x, y)
This code executes perfectly right, so I cannot reproduce your bug.
At this point, few comments may be useful for you :
GaussianNB
instead of a gradient boosting classifier only suited for binary classificationAssertionError: assert type_of_target(y) == "multiclass"
If you think we should iterate further and change the code base all the same, please start from the minimal working example I just provided to expose your diagnostic.
Happy to help, and feel free to ask other questions.
Dear @gmartinonQM
The information regarding the classifier is clear and understood but still I am not able to make it work with the regression use case. I have created another example that is showing a similar error than my original issue (being not able to encode columns):
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import fetch_openml
from mapie.regression import MapieRegressor
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df_train, df_test = train_test_split(df, test_size=0.33, random_state=0)
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
("binned_numeric", KBinsDiscretizer(n_bins=10), ["VehAge", "DrivAge"]),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"categorical",
OrdinalEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
)
],
remainder="drop",
)
poisson_gbrt = Pipeline(
[
("preprocessor", model_preprocessor),
(
"regressor",
HistGradientBoostingRegressor(loss="poisson", max_leaf_nodes=128),
),
]
)
mapie = MapieRegressor(poisson_gbrt)
mapie.fit(
df_train, df_train["Frequency"]
)
Error as follows:
ValueError Traceback (most recent call last)
<ipython-input-34-5fa6526911b2> in <module>
----> 1 mapie.fit(
2 df_train, df_train["Frequency"]
3 )
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/mapie/regression.py in fit(self, X, y, sample_weight)
457 cv = self._check_cv(self.cv)
458 estimator = self._check_estimator(self.estimator)
--> 459 X, y = check_X_y(
460 X, y, force_all_finite=False, dtype=["float64", "int", "object"]
461 )
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
962 raise ValueError("y cannot be None")
963
--> 964 X = check_array(
965 X,
966 accept_sparse=accept_sparse,
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
744 array = array.astype(dtype, casting="unsafe", copy=False)
745 else:
--> 746 array = np.asarray(array, order=order, dtype=dtype)
747 except ComplexWarning as complex_warning:
748 raise ValueError(
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
100 return _asarray_with_like(a, dtype=dtype, order=order, like=like)
101
--> 102 return array(a, dtype, copy=False, order=order)
103
104
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/generic.py in __array__(self, dtype)
1991
1992 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993 return np.asarray(self._values, dtype=dtype)
1994
1995 def __array_wrap__(
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order, like)
100 return _asarray_with_like(a, dtype=dtype, order=order, like=like)
101
--> 102 return array(a, dtype, copy=False, order=order)
103
104
ValueError: could not convert string to float: 'A'
@gmartinonQM
Here you can see that if I encode the values first mapie works just fine:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.linear_model import Ridge
from mapie.regression import MapieRegressor
from sklearn.datasets import fetch_openml
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df_final = pd.concat([df.drop(columns=["VehBrand", "VehGas", "Region", "Area"]),
pd.get_dummies(df[["VehBrand", "VehGas", "Region", "Area"]], sparse=True)], axis=1)
df_train, df_test = train_test_split(df_final, test_size=0.33, random_state=0)
hist_reg = Pipeline(
[
("regressor", HistGradientBoostingRegressor(loss='poisson')),
]
)
from mapie.regression import MapieRegressor
mapie = MapieRegressor(hist_reg)
mapie.fit(
df_train, df_train["Frequency"]
)
Hi, where you able to solve this?
I'm having a similar problem when making use of ColumnTransformer and Pipeline. Not really an expert in pipelines, but I have the following set up of transformation methods:
numeric_preprocessor = Pipeline(
steps = [
('imputation_mean', SimpleImputer(strategy='mean')),
('scaler', RobustScaler())
]
)
categorical_preprocessor = Pipeline(
steps = [
('inputation_mode', SimpleImputer(strategy='most_frequent')),
('ohe',OneHotEncoder(handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer([
('cat_preprocessor', categorical_preprocessor, vars_cat),
('num_preprocessor', numeric_preprocessor, vars_num)])
catboost_pipe = make_pipeline(preprocessor, CatBoostRegressor(random_state=123, verbose=0))
catboost_pipe.fit(train_df[vars_cat + vars_num], train_df[target])
This part of the code works fine. However, when I try to use MapieRegressor
as shown in the quick start tutorial:
mapie = MapieRegressor(catboost_pipe)
mapie.fit(train_df[vars_cat + vars_num], train_df[target])
I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
424 try:
--> 425 all_columns = X.columns
426 except AttributeError:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
C:\Users\FRANCI~1.PAR\AppData\Local\Temp/ipykernel_25908/3938012699.py in <module>
1 mapie = MapieRegressor(catboost_pipe)
----> 2 mapie.fit(train_df[vars_cat + vars_num], train_df[target])
~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight)
489 self.n_samples_val_ = [X.shape[0]]
490 else:
--> 491 self.single_estimator_ = fit_estimator(
492 clone(estimator), X, y, sample_weight
493 )
~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\utils.py in fit_estimator(estimator, X, y, sample_weight)
112 estimator.fit(X, y, sample_weight=sample_weight)
113 else:
--> 114 estimator.fit(X, y)
115 return estimator
116
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
328 """
329 fit_params_steps = self._check_fit_params(**fit_params)
--> 330 Xt = self._fit(X, y, **fit_params_steps)
331 with _print_elapsed_time('Pipeline',
332 self._log_message(len(self.steps) - 1)):
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
290 cloned_transformer = clone(transformer)
291 # Fit or load from cache the current transformer
--> 292 X, fitted_transformer = fit_transform_one_cached(
293 cloned_transformer, X, y, None,
294 message_clsname='Pipeline',
~\Anaconda3\envs\ktp_explore\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
347
348 def __call__(self, *args, **kwargs):
--> 349 return self.func(*args, **kwargs)
350
351 def call_and_shelve(self, *args, **kwargs):
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
738 with _print_elapsed_time(message_clsname, message):
739 if hasattr(transformer, 'fit_transform'):
--> 740 res = transformer.fit_transform(X, y, **fit_params)
741 else:
742 res = transformer.fit(X, y, **fit_params).transform(X)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
527 self._validate_transformers()
528 self._validate_column_callables(X)
--> 529 self._validate_remainder(X)
530
531 result = self._fit_transform(X, y, _fit_transform_one)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_remainder(self, X)
325 cols = []
326 for columns in self._columns:
--> 327 cols.extend(_get_column_indices(X, columns))
328
329 remaining_idx = sorted(set(range(self._n_features)) - set(cols))
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
425 all_columns = X.columns
426 except AttributeError:
--> 427 raise ValueError("Specifying the columns using strings is only "
428 "supported for pandas DataFrames")
429 if isinstance(key, str):
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Is this an error on my code? Not sure because catboost_pipe.fit(train_df[vars_cat + vars_num], train_df[target])
worked as expected.
Thanks for the help!
Hi @edgBR ,
here is a minimal working example, that abstracts away all your use case particularities :
import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.naive_bayes import GaussianNB from mapie.classification import MapieClassifier np.random.seed(2) n = 20 x = pd.DataFrame( { "x_cat": np.random.choice(["A", "B", "C"], size=n), "x_num": np.random.randn(n) } ) y = pd.Series(np.random.choice([0, 1, 2], size=n)) categorical_transformer = OneHotEncoder(handle_unknown="ignore", categories="auto") preprocessor = ColumnTransformer( transformers=[ ("get_dummies", categorical_transformer, [0]) ], remainder="passthrough" ) estimator = GaussianNB() model = Pipeline( [ ("preprocessor", preprocessor), ("estimator", estimator) ] ) mapie_classifier = MapieClassifier(model) mapie_classifier.fit(x, y)
This code executes perfectly right, so I cannot reproduce your bug.
At this point, few comments may be useful for you :
- In your example, your target is binary. MAPIE is not suited for binary classification, only for multi-class classification. Only in this setting does the notion of "prediction set" makes sense. For binary notion of uncertainty, refer to binary calibration : https://scikit-learn.org/stable/modules/calibration.html
- Note that this is why I used a natively multiclass estimator
GaussianNB
instead of a gradient boosting classifier only suited for binary classification- In your second code example, the error you get is different from the original one you mentionned in the issue. This is just an assert in MAPIE code checking that we are indeed in the multiclass settings. In the logs, you can read :
AssertionError: assert type_of_target(y) == "multiclass"
If you think we should iterate further and change the code base all the same, please start from the minimal working example I just provided to expose your diagnostic.
Happy to help, and feel free to ask other questions.
Hi @gmartinonQM , would you be able to show/give an example similar to these, but with more than 1 categorical feature? And I see that in the ColumnTransformer
section, you refer to the categorical column as [0]
right?
Is it possible to refer to it with their name? For example: ['column_1, column_2, etc]
.
I have made this small dataframe/example with some sample values and 2 categorical features, 1 numerical and target column, alongside two different transformations that are used within ColumnTransformer
and with CatBoostRegressor
as the predictor.
# Sample df
sample_df = x = pd.DataFrame(
{
"x_cat_1": ['type_1', 'type_2', 'type_3','type_1', 'type_2',np.nan],
"x_cat_2": ['size_1', 'size_1', 'size_2',np.nan,'size_1', 'size_2'],
"x_num_1": [0,1,1,4,np.nan,5],
'target': [5,7,3,9,10,8]
}
)
vars_cat = ['x_cat_1','x_cat_2']
vars_num = ['x_num_1']
target = ['target']
# Transformations
numeric_preprocessor = Pipeline(
steps = [
('imputation_mean', SimpleImputer(strategy='mean')),
('scaler', RobustScaler())
]
)
categorical_preprocessor = Pipeline(
steps = [
('imputation_mode', SimpleImputer(strategy='most_frequent')),
('ohe',OneHotEncoder(handle_unknown='ignore'))
]
)
# Define a column transformer that uses your previous transformers and applies them to specific columns defined.
preprocessor = ColumnTransformer([
('cat_preprocessor', categorical_preprocessor, vars_cat),
('num_preprocessor', numeric_preprocessor, vars_num)])
# Pipeline
catboost_pipe = make_pipeline(preprocessor, CatBoostRegressor(random_state=123, verbose=0))
# MAPIE Regressor
mapie = MapieRegressor(catboost_pipe)
mapie.fit(sample_df[vars_cat + vars_num], sample_df[target])
Error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
424 try:
--> 425 all_columns = X.columns
426 except AttributeError:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
C:\Users\FRANCI~1.PAR\AppData\Local\Temp/ipykernel_25908/3044912112.py in <module>
1 mapie = MapieRegressor(catboost_pipe)
----> 2 mapie.fit(sample_df[vars_cat + vars_num], sample_df[target])
~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight)
489 self.n_samples_val_ = [X.shape[0]]
490 else:
--> 491 self.single_estimator_ = fit_estimator(
492 clone(estimator), X, y, sample_weight
493 )
~\Anaconda3\envs\ktp_explore\lib\site-packages\mapie\utils.py in fit_estimator(estimator, X, y, sample_weight)
112 estimator.fit(X, y, sample_weight=sample_weight)
113 else:
--> 114 estimator.fit(X, y)
115 return estimator
116
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
328 """
329 fit_params_steps = self._check_fit_params(**fit_params)
--> 330 Xt = self._fit(X, y, **fit_params_steps)
331 with _print_elapsed_time('Pipeline',
332 self._log_message(len(self.steps) - 1)):
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
290 cloned_transformer = clone(transformer)
291 # Fit or load from cache the current transformer
--> 292 X, fitted_transformer = fit_transform_one_cached(
293 cloned_transformer, X, y, None,
294 message_clsname='Pipeline',
~\Anaconda3\envs\ktp_explore\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
347
348 def __call__(self, *args, **kwargs):
--> 349 return self.func(*args, **kwargs)
350
351 def call_and_shelve(self, *args, **kwargs):
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
738 with _print_elapsed_time(message_clsname, message):
739 if hasattr(transformer, 'fit_transform'):
--> 740 res = transformer.fit_transform(X, y, **fit_params)
741 else:
742 res = transformer.fit(X, y, **fit_params).transform(X)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
527 self._validate_transformers()
528 self._validate_column_callables(X)
--> 529 self._validate_remainder(X)
530
531 result = self._fit_transform(X, y, _fit_transform_one)
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\compose\_column_transformer.py in _validate_remainder(self, X)
325 cols = []
326 for columns in self._columns:
--> 327 cols.extend(_get_column_indices(X, columns))
328
329 remaining_idx = sorted(set(range(self._n_features)) - set(cols))
~\Anaconda3\envs\ktp_explore\lib\site-packages\sklearn\utils\__init__.py in _get_column_indices(X, key)
425 all_columns = X.columns
426 except AttributeError:
--> 427 raise ValueError("Specifying the columns using strings is only "
428 "supported for pandas DataFrames")
429 if isinstance(key, str):
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
@gmartinonQM
Here you can see that if I encode the values first mapie works just fine:
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, OrdinalEncoder from sklearn.preprocessing import StandardScaler, KBinsDiscretizer from sklearn.compose import ColumnTransformer from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.linear_model import Ridge from mapie.regression import MapieRegressor from sklearn.datasets import fetch_openml df = fetch_openml(data_id=41214, as_frame=True).frame df["Frequency"] = df["ClaimNb"] / df["Exposure"] df_final = pd.concat([df.drop(columns=["VehBrand", "VehGas", "Region", "Area"]), pd.get_dummies(df[["VehBrand", "VehGas", "Region", "Area"]], sparse=True)], axis=1) df_train, df_test = train_test_split(df_final, test_size=0.33, random_state=0) hist_reg = Pipeline( [ ("regressor", HistGradientBoostingRegressor(loss='poisson')), ] ) from mapie.regression import MapieRegressor mapie = MapieRegressor(hist_reg) mapie.fit( df_train, df_train["Frequency"] )
Were you able to solve this by not transforming/encoding explicitly before using MAPIE?
Hi @fjpa121197
Edit:
No, I have noticed that if I refer the columns by name for any preprocessing step I am getting the same error than you.
Hi @fjpa121197
Edit:
No, I have noticed that if I refer the columns by name for any preprocessing step I am getting the same error than you.
Hi @edgBR, thanks for letting me know. I'm waiting for @gmartinonQM, maybe we are refering to the columns in a wrong way.
Thanks anyway!
Hi @edgBR @fjpa121197 , indeed, all your problems have the same cause : MAPIE, at some point, require that input data is (or is convertible to) a numpy array. When using pipelines based on column names, a bug comes out because the names disappear during the conversion. I have begun a pull request about this : https://github.com/scikit-learn-contrib/MAPIE/pull/136
This fixes the issue for regression (not for classification yet), but breaks scikit-learn estimator compatibility. Some additional work is needed to make all unit tests passing.
Feel free to suggest changes to this pull request. In the meantime, I will try to converge to a solution during next weeks.
Hi @fjpa121197
A dirty trick to bypass this:
This seems a point of friction and I am not sure if pipeline.transform() is transforming the data with the right parameters (like if we have a simple mean imputer I am not sure if it is transforming with the mean of the training set or if it is refitting again).
Anyhow, great hopes for the fix @gmartinonQM.
Hi @edgBR
I did this, and I was able to use MAPIE without a problem. When calling pipeline.transform() on the test set, it should not refit the transformers again.
Thanks for helping with these @gmartinonQM, looking forward to these fix! Thanks in advance!
Good news @fjpa121197 @edgBR , I have managed to resolve all side effects and unit tests. The linked PR will be merged soon, and the bug will disappear in next MAPIE release
Thanks @gmartinonQM!
Describe the bug Dear colleagues, I am creating a system to classify customers in 2 binary classes and then apply a regression model to one of the classes.
Some of my features are string that I obviously need to encode. In this case with one hot encoding.
To Reproduce
My code is as follows:
Expected behavior After this I will expect that I can run:
y_pred, y_pis = mapie_estimator.predict(data_test)
Screenshots
Being this value part of one of the categorical columns which its being encoded by the preprocessor.
When training the model without mapie everything works correctly:
Desktop (please complete the following information):
Scikit learn dependencies: