jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Difference in XGBoost predictions #166

Closed ShalvaBagaturia closed 3 years ago

ShalvaBagaturia commented 3 years ago

Hello.

I face the following issue: when i make my model in Python and export it to PMML file, and load this PMML file to make prediction, i got different results. Here is an illustration:

from sklearn2pmml import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline
import sklearn2pmml
import xgboost as xgb
print(f'xgboost verstion={xgb.__version__}')
print(f'sklearn2pmml verstion={sklearn2pmml.__version__}')
print(f'pandas verstion={pd.__version__}')

temp_model = xgb.XGBRegressor(base_score=0.05, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1.0,
             eval_metric=['poisson-nloglik'], gamma=0.75, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.09, max_delta_step=0.699999988, max_depth=5,
             min_child_weight=8.0, missing=1, monotone_constraints='()',
             n_estimators=343, n_jobs=-1, num_parallel_tree=1,
             objective='count:poisson', random_state=0, reg_alpha=0,
             reg_lambda=0.255, scale_pos_weight=None, subsample=0.8,
             tree_method='approx', validate_parameters=1, verbosity=None)

X = pd.read_csv('x.csv')
y = pd.read_csv('y.csv')
y.name = 'target'

pipe = make_pmml_pipeline(temp_model, active_fields = X.columns.tolist(), target_fields = y.name)

pipe.fit(X, y)

sklearn2pmml.sklearn2pmml(pipe, pmml="./PMML/model.pmml", with_repr=True,debug=False)

xgboost verstion=1.4.2 sklearn2pmml verstion=0.74.4 pandas verstion=1.2.4

from pypmml import Model
model_pmml = Model.fromFile("./PMML/model.pmml")
predict_from_python = pd.DataFrame(pipe.predict(X), columns=['predict_python'])
predict_from_pmml = model_pmml.predict(X)
pd.concat([predict_from_python,predict_from_pmml], axis = 1)

and i got different values in predict_from_python and predict_from_pmml.

Why this may happen?

ShalvaBagaturia commented 3 years ago

here is the files x.csv y.csv

vruusmann commented 3 years ago

Can reproduce locally.

Very interesting!

vruusmann commented 3 years ago

Gotcha - it's a missing value issue after all!

In your other issue (https://github.com/jpmml/sklearn2pmml/issues/303) you declare: "No missing data, no sparse / dense problem"

Yet, in your XGBRegressor parameterization you have the following assignment: missing = 1. This assignment means "if the training data matrix contains a 1 value, assume this cell contains a missing value instead". In other words, you didin't think so, but your XGBRegressor is/was actually fitted with a sparse data matrix (sparse == "contains missing values").

If you delete this missing = 1 assignment, then the PMML side makes correct predictions.

BTW, I'd suggest you to switch from PyPMML to JPMML-Evaluator-Python.

ShalvaBagaturia commented 3 years ago

Appreciate!

vruusmann commented 3 years ago

Looking at your data matrix (x.csv file), then I get the impression that missing values are encoded as -999 values.

If so, then you should be using the following configuration instead: XGBRegressor(missing = -999).

You may subscribe to the newly opened issue (#167) in order to receive a notification when the missing attribyte support gets implemented. I believe it should happen sometimes this week already.

ShalvaBagaturia commented 3 years ago

Thanks a lot for the upcoming update

vruusmann commented 2 years ago

See https://openscoring.io/blog/2022/04/12/onehot_encoding_sklearn_xgboost_pipeline/