jpmml / jpmml-xgboost

Java library and command-line application for converting XGBoost models to PMML
GNU Affero General Public License v3.0
128 stars 43 forks source link

PyPMML is making incorrect predictions with XGBoost PMML models #76

Closed neham764 closed 5 days ago

neham764 commented 5 days ago

xgb params:

params = {
    'colsample_bytree': 0.73,
    'learning_rate': 0.09,
    'subsample': 0.8,
    'min_child_weight': 50.0,
    'max_depth': 4.0,
    'max_leaves': 15.0
}

model_test = xgb.XGBClassifier(objective = 'binary:logistic', 
                        eval_metric = 'auc',
                        grow_policy = 'lossguide',
                        tree_method = 'hist',
                        max_depth = 4,
                        max_leaves = 15,
                        random_state=101, 
                        #n_estimators=2000,
                        #early_stopping_rounds=100,
                        colsample_bytree= 0.73,
                        learning_rate= 0.09,
                        subsample= 0.8,
                        min_child_weight= 50,
                        n_estimators=112,
                        **monotone

                       )
eval_set = [(X_test1,y_test1)]

###PMML file generation
# Fit the model
pipeline = PMMLPipeline([
    ('final_model', model_test)])

pipeline.fit(X_train,y_train)

from sklearn2pmml import sklearn2pmml
from sklearn.pipeline import Pipeline

## Save the model object to PMML
sklearn2pmml(pipeline, "dsp_model_pmml.pmml", with_repr = True)

pmml = Model.load('dsp_model_pmml.pmml')

## Score with the PMML object
train2['pmml_scores_4_4'] = list(pmml.predict(train2[features])['probability(1)'])

train2["orig_score"]= model.predict_proba(train2[features])[:, 1]

I am seeing a mismatch between the scores generated through xgboost and pmml :

count    2.558100e+04
mean    -1.196365e-04
std      1.183371e-03
min     -3.681402e-02
25%     -3.511670e-08
50%     -5.165062e-09
75%      2.265779e-08
max      9.465751e-03

pmml is trained exactly like xgboost:

        <Extension name="repr">PMMLPipeline(steps=[('final_model', XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.73, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='auc', feature_types=None,
              gamma=None, grow_policy='lossguide', importance_type=None,
              interaction_constraints=None, learning_rate=0.09, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=15,
              min_child_weight=50, missing=nan,
              monotone_constraints='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,-1)',
              multi_strategy=None, n_estimators=112, n_jobs=None,
              num_parallel_tree=None, random_state=101, ...))])</Extension>
vruusmann commented 5 days ago

pmml = Model.load('dsp_model_pmml.pmml')

Is this a known issue?

Yes, it is a very well known issue that the PyPMML package is making incorrect predictions.

Please switch to the JPMML-Evaluator-Python package, and all predictions will be correct all the time.

neham764 commented 4 days ago

Thanks for your prompt reply. Is there a fix expected regarding this soon?

vruusmann commented 3 days ago

Is there a fix expected regarding this soon?

I'm not affiliated with the PyPMML package in any way.

I'm resposible for the XGBoost-to-PMML converter. And this part of the workflow is working correctly.

neham764 commented 3 days ago

Thanks, Jppml evaluator worked.