Annanapan commented 6 months ago

When converting the pipeline to onnx, I met the error:

Unable to find a shape calculator for type '<class 'xgboost.sklearn.XGBClassifier'>'. It usually means the pipeline being converted contains a transformer or a predictor with no corresponding converter implemented in sklearn-onnx. If the converted is implemented in another library, you need to register the converted so that it can be used by sklearn-onnx (function update_registered_converter). If the model is not yet covered by sklearn-onnx, you may raise an issue to https://github.com/onnx/sklearn-onnx/issues to get the converter implemented or even contribute to the project. If the model is a custom model, a new converter must be implemented. Examples can be found in the gallery.

The pipeline code contain a preprocessor and a XGB decision tree model, I created is as followed:

num_features = X.select_dtypes(include=['int64', 'float64']).columns
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), 
    ('scaler', StandardScaler())])

cat_features = X.select_dtypes(include=['object', 'category']).columns
cat_transformer = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
    ('imputer', SimpleImputer(strategy='constant', fill_value=-1)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ])

model_xgbt = xgb.XGBClassifier(
    booster='gbtree',
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=5,
    scale_pos_weight=4,
    random_state=42
)

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', model_xgbt)])
pipeline.fit(X_train, y_train)

initial_type = [
    ('age', FloatTensorType([None, 1])),
    ('zipCode', StringTensorType([None, 1])),
    ('es', StringTensorType([None, 1])),
    ('ec', StringTensorType([None, 1])),
    ('oc', StringTensorType([None, 1])),
    ('income', StringTensorType([None, 1])),
    ('nw', StringTensorType([None, 1])),
    ('ie', StringTensorType([None, 1])),
    ('irt', StringTensorType([None, 1])),
    ('ig', StringTensorType([None, 1])),
    ('r, StringTensorType([None, 1])),
    ('tr', StringTensorType([None, 1])), 
    ('st', Int64TensorType([None, 1]))
]

onnx_model = convert_sklearn(pipeline, initial_types=initial_type)

with open("pipeline_model_xgbt.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

The error occurs when converting the pipeline. I researched that all the steps in preprocessor are acceptable. I wonder whether it's onnx that cannot deal with complex transformers.

versions: skl2onnx: 1.16.0 sklearn: 1.4.0 Python: 3.11.7

Annanapan commented 6 months ago

I add:

update_registered_converter(
    XGBClassifier,
    "XGBoostClassifier",
    calculate_linear_classifier_output_shapes,
    convert_xgboost,
    options={"nocl": [True, False], "zipmap": [True, False, "columns"]}
)

and the error dissappears, but the predictions are different from the raw predictions using pipelines directly

Annanapan commented 6 months ago

Is this the right way to prepare the input for prediction using onnx?

prepare X_test_inputs for onnx model

X_test_inputs = {c: X_test[c].values for c in X_test.columns}

for c in num_features:
    v = X_test[c].dtype
    if v == "float64":
       X_test_inputs[c] = X_test_inputs[c].astype(np.float32)
for k in X_test_inputs:
    X_test_inputs[k] = X_test_inputs[k].reshape((X_test_inputs[k].shape[0], 1))

xadupre commented 5 months ago

You should follow this tutorial to register a XGB model: https://onnx.ai/sklearn-onnx/auto_tutorial/plot_gexternal_xgboost.html.

onnx / sklearn-onnx

MissingShapeCalculator: Unable to find a shape calculator for type '<class 'xgboost.sklearn.XGBClassifier'>'. #1076

prepare X_test_inputs for onnx model