onnx / onnxmltools

ONNXMLTools enables conversion of models to ONNX
https://onnx.ai
Apache License 2.0
1.03k stars 186 forks source link

convert_xgboost gives wrong values if using early_stopping #517

Open rgreen1995 opened 3 years ago

rgreen1995 commented 3 years ago

Hi I'm trying to convert an XGBClassifier to onnx and noticed that if I use a large number of n_estimators and then use early_stopping argument, convert the model to onnx and then load and run the model I get incorrect probabilities using. I've attached an example below

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from onnxmltools.convert.common import data_types
from onnxmltools.convert import convert_xgboost
import onnxruntime as onnx_rt

X, y = make_classification(n_samples=1000, n_classes=6, n_clusters_per_class=1, n_informative = 10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_eval, X_test, y_eval, y_test = train_test_split(X_test, y)

early_stopping_model = xgb.XGBClassifier(n_estimators=1000)

early_stopping_model.fit(
    X_train,
    y_train,
    eval_set=[(X_eval, y_eval)],
    eval_metric="mlogloss",
    early_stopping_rounds=20,
    verbose=False,
)

initial_type = [('float_input', data_types.FloatTensorType([None, X_train.shape[1]]))]
model_onnx = convert_xgboost(early_stopping_model, initial_types=initial_type)
with open("test.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

sess = onnx_rt.InferenceSession("test.onnx")
input_name = sess.get_inputs()[0].name
probs = sess.get_outputs()[1].name
sess.get_outputs()[1].name
pred_onx_early = sess.run(None, {input_name: X_train[0:1].tolist()})

print(f'Onnx: {pred_onx_early[1][0]}')
print(f'Original: {early_stopping_model.predict_proba(X_train[0:1]).tolist()[0]}')

This returns

Onnx: [0.00427639 0.5        0.5        0.5        0.5        0.5       ] # doesn't even add up to 1!!
Original: [0.0010141434613615274, 0.0030910135246813297, 0.993025541305542, 0.00016845663776621222, 0.001386264804750681, 0.0013146025594323874]

I found that either turning off removing early_stopping_rounds or making n_estimators smaller ( I think small enough so that the stopping condition doesn't matter) fixes the problem and the ONNX model probabilities are the same as the original model

Is this a bug or am I missing something ?

xiaohk commented 2 years ago

I am facing the same issue. For me, the ONNX model's performance is way worse than the original XGBoost model even if I am not using early stopping.

Did you find any solution?