jpmml / jpmml-xgboost

Java library and command-line application for converting XGBoost models to PMML
GNU Affero General Public License v3.0
128 stars 43 forks source link

Predictions not matching with PMML and XGBoost Model #75

Closed Shashwat3011 closed 1 month ago

Shashwat3011 commented 1 month ago

I am creating my XGBoost Model using code:

xgb = XGBClassifier(objective = 'binary:logistic',
                    booster = 'gbtree',
                    learning_rate = 0.1,
                    subsample = 0.8,
                    random_state = 42,
                    colsample_bytree = 0.8,
                    gamma = 0.5,
                    max_depth = 6,
                    min_child_weight = 4,
                    n_estimators = 130,
                    missing = -9999.01)

xgb.fit(x_train.copy(), y_train, sample_weight = sample_weights)

Then Calling the Pipeline

pipeline = PMMLPipeline([
("classifier", XGBClassifier (  objective = 'binary:logistic',
                    booster = 'gbtree',
                    learning_rate = 0.1,
                    subsample = 0.8,
                    random_state = 42,
                    colsample_bytree = 0.8,
                    gamma = 0.5,
                    max_depth = 6,
                    min_child_weight = 4,
                    n_estimators = 130,
                    missing = -9999.01))
])

pipeline.fit(x_train.copy(), y_train, classifier__sample_weight = sample_weights)

Then Exporting the PMML file

sklearn2pmml(pipeline, "pmmlfile.pmml")

But when I am trying to check predictions on same data using xgb and pmml file, I am getting different predictions.

from sklearn_pmml_model.ensemble import PMMLGradientBoostingClassifier
xgb_PMML = PMMLGradientBoostingClassifier(pmml = 'pmmlfile.pmml')

xgb_pred_train_pmml_file = xgb_PMML.predict_proba(x_train.copy())[:,1]
xgb_pred_train_pmml_file = trunc(xgb_pred_train_pmml_file, decs=12)

xgb_pred_train_pmml_file

array([0.00743602, 0.00187997, 0.04121913, ..., 0.00482244, 0.0439931 ,
       0.09430512])

xgb_pred_train_model_object = xgb.predict_proba(x_train.copy())[:,1]
xgb_pred_train_model_object = trunc(xgb_pred_train_model_object, decs=12)

xgb_pred_train_model_object

array([0.00410176, 0.00098528, 0.00278918, ..., 0.00216036, 0.00171637,
       0.00270639])
vruusmann commented 1 month ago

In the general case, you can pass xgboost.sklearn.XGBClassifier objects directly to the sklearn2pmml.sklearn2pmml() utility function, there is no need to create a (PMML)Pipeline wrapper around it:

xgb = XGBClassifier(...)
xgb.fit(X, y, sample_weight = weights)

sklearn2pmml(xgb, "XGBClassifier.pmml")

You bring in the PMMLPipeline wrapper when there is something PMML-specific going on. For example, you want to configure the PMML markup in a specific way, embed a model verification dataset etc.

In the current case, where the objective is to import the generated PMML document back into a Scikit-Learn environment, you should disable all PMML optimizations such as decision tree compaction and flattening:

pipeline = PMMLPipeline([
  ("classifier", XGBClassifier(...))
])
pipeline.fit(X, y)
# THIS!
pipeline.configure(compact = False)

sklearn2pmml(pipeline, "XGBClassifier-default.pmml")

Do the test - compare "XGBClassifier.pmml" and "XGBClassifier-default.pmml" files with each other and see how they differ (eg. the former is half the size of the other).

But when I am trying to check predictions on same data using xgb and pmml file, I am getting different predictions.

There are two places where the error might occur in your workflow:

  1. The PMML producer (here: the SkLearn2PMML package) is generating PMML markup incorrectly.
  2. The PMML consumer (here: the SkLearn-PMML-Model package) is parsing/interpreting PMML markup incorrectly.

Right now, you appear to be thinking that the first component (ie. the SkLearn2PMML package) is at fault. However, my experience tells me that this is not so, and you should be looking at the second component (ie. the SkLearn-PMML-Model package) instead.

You can verify my intuition very easily, by scoring the PMML document using the JPMML-Evaluator-Python package. Don't use any alternative Python PMML evaluators for this important debugging job, because they are all inferior in comparison to JPMML-Evaluator(-Python).

vruusmann commented 1 month ago

from sklearn_pmml_model.ensemble import PMMLGradientBoostingClassifier

TLDR: The source of misprediction resides inside the Sklearn-PMML-Model package.

The PMMLGradientBoostingClassifier class is meant to be interoperable with Scikit-Learn's GradientBoostingClassifier class. It may or may not be interoperable with other GBDT models such as XGBoost or LightGBM.

The import from PMML is much harder when the PMML markup has been optimized. Hence you should follow my suggestion from above, and disable compaction.

Please re-raise your issue with the Sklearn-PMML-Model project. The JPMML part of the workflow is correct.

Shashwat3011 commented 1 month ago

Hey, Thanks for answeing

I rechecked, I exported first xgb classifer to PMML

sklearn2pmml(xgb, "pmml_file,.pmml")

Got Error:

TypeError                                 Traceback (most recent call last)
Input In [206], in <cell line: 1>()
----> 1 sklearn2pmml(xgb, "./Final QA Artifacts/pmml_file.pmml")

File ~/SageMaker/conda-awxsight/miniconda/envs/awxsight/lib/python3.10/site-packages/sklearn2pmml/__init__.py:249, in sklearn2pmml(pipeline, pmml, with_repr, java_home, java_opts, user_classpath, debug)
    247     print("{0}: {1}".format(java_version[0], java_version[1]))
    248 if not isinstance(pipeline, PMMLPipeline):
--> 249     raise TypeError("The pipeline object is not an instance of {0}. Use the 'sklearn2pmml.make_pmml_pipeline(obj)' utility function to translate a regular Scikit-Learn pipeline or estimator to a PMML pipeline".format(PMMLPipeline.__name__))
    250 if with_repr:
    251     pipeline.repr_ = repr(pipeline)

TypeError: The pipeline object is not an instance of PMMLPipeline. Use the 'sklearn2pmml.make_pmml_pipeline(obj)' utility function to translate a regular Scikit-Learn pipeline or estimator to a PMML pipeline

And I can see in the above code, that you suggested,

pipeline = PMMLPipeline([
  ("classifier", XGBClassifier(...))
])
pipeline.fit(X, y)
# THIS!
pipeline.configure(compact = False)

sklearn2pmml(pipeline, "XGBClassifier-default.pmml")

You are not using sample_weights parameter in Pipeline, Anyway I checked with Sample Weights Parameters still getting different predictions.

vruusmann commented 1 month ago

Got Error:

You are using some outdated SkLearn2PMML package version. Please upgrade to the latest!

I'd say that it's elementary to update/upgrade to the latest before reporting any issues to a software project. Otherwise, you're just wasting everybody's time.

You are not using sample_weights parameter in Pipeline,

I was just giving you an idea how to disable PMML optimizations using the PMMLPipeline.configure() method. I thought you'd be able to complete the rest of the exercise independently.

Shashwat3011 commented 1 month ago

Sorry for that, I upgraded the library as well, But I still can see predictions are different. Just to add

xgb = XGBClassifier(objective = 'binary:logistic',
                    booster = 'gbtree',
                    learning_rate = 0.1,
                    subsample = 0.8,
                    random_state = 42,
                    colsample_bytree = 0.8,
                    gamma = 0.5,
                    max_depth = 6,
                    min_child_weight = 4,
                    n_estimators = 130,
                    missing = -9999.01)

xgb.fit(x_train.copy(), y_train, sample_weight = sample_weights)

pipeline = PMMLPipeline([
("classifier", XGBClassifier (  objective = 'binary:logistic',
                    booster = 'gbtree',
                    learning_rate = 0.1,
                    subsample = 0.8,
                    random_state = 42,
                    colsample_bytree = 0.8,
                    gamma = 0.5,
                    max_depth = 6,
                    min_child_weight = 4,
                    n_estimators = 130,
                    missing = -9999.01))
])

pipeline.fit(x_train.copy(), y_train, classifier__sample_weight = sample_weights)

Predictions from xgb and pipeline are matching but PMML is not getting properly exported using sklearn2PMML

Sorry for bugging again

vruusmann commented 1 month ago

.. but PMML is not getting properly exported using sklearn2PMML

What does it mean? Can you be more specific?

I asked you to verify that XGBoost/Scikit-Learn predictions match JPMML-Evaluator-Python predictions? Is this true or not?

Shashwat3011 commented 1 month ago

Yes I checked the prediction from the Model and JPMML evaluator. They are matching

from jpmml_evaluator import make_evaluator
input_data = x_train.to_dict(orient='records')

pmml_predictions = [evaluator.evaluate(record)['probability(1)'] for record in input_data]
pmml_predictions = np.array(pmml_predictions)

import xgboost as xgb
xgb_model = xgb.XGBClassifier()
xgb_model.load_model('Model Object.json')

xgb_predictions = xgb_model.predict_proba(x_train)[:, 1]

xgb_predictions = np.array(xgb_predictions)
comparison = np.allclose(pmml_predictions, xgb_predictions, atol=1e-6)
print("Predictions match:", comparison)
Predictions match: True
vruusmann commented 1 month ago

Yes I checked the prediction from the Model and JPMML evaluator. They are matching

There you go - it means that the PMML generation part is correct, and you must search for the error in the PMML-to-SkLearn reverse translation part.

I can see that you've already opened a new issue: https://github.com/iamDecode/sklearn-pmml-model/issues/58

Shashwat3011 commented 1 month ago

Yes, Thanks Sir for help. Really appreciate your quick responses. Again Thanks a lot Sir !