jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

Effectively debugging XGBoost pipelines (mis-matching predictions between Python and (J)PMML) #394

Closed jhaneyrf closed 11 months ago

jhaneyrf commented 1 year ago

I'm going to keep this brief because I'm not sure what the best things for you to look at are, but I promise to respond quickly to your requests.

I've built an XGBoost model and created a PMMLPipeline object as well as a pmml file:

PMML_PIPELINE = PMMLPipeline(steps=[
    ('preprocessor', PREPROCESSOR), ('classifier', MODEL))
sklearn2pmml(PMML_PIPELINE, "pmml_file.pmml")

I then used jpmml_evaluator to load the model from pmml_file.pmml

PMML_MODEL = jpmml_evaluator.make_evaluator("pmml_file.pmml", locatable=True)

The outputs of the following two commands only match about 85% of the time:

PMML_PIPELINE.predict_proba(test_data)
PMML_MODEL.predict(test_data)

I can't figure out why the other 15% don't match.

Do you have any suggestions how I should go about debugging this?

vruusmann commented 1 year ago

Do you have any suggestions how I should go about debugging this?

Simple - you should vary one component at a time, and see if things improve or not.

For starters, replace XGBoost estimator with some other estimator, such as LightGBM estimator or native Scikit-Learn estimator (eg. GradientBoostingClassifier, GradientBoostingRegressor).

My educated guess is that there's something wrong with your XGBoost estimator's configuration (already on the Python side of the equation). So, if you use some other - less error prone - estimator type such as LightGBM, the results should be correct. If the results are still off, then we can look into the pre-processing part of the pipeline.

By any chance, are you using sklearn.preprocessing.OneHotEncoder with XGBoost estimator?

Also, does your test_data contain categorical features? How about missing feature values?

vruusmann commented 1 year ago

Just to be clear - I believe that the converter has been producing a correct PMML document, and that the JPMML-Evaluator library in interpreting its markup correctly.

The trouble is somewhere in the Python code. Your pipeline is not doing what you think it's doing.

vruusmann commented 1 year ago

By any chance, are you using sklearn.preprocessing.OneHotEncoder with XGBoost estimator?

Interesting idea - the SkLearn2PMML/JPMML-SkLearn converter should raise an error if it encounters a OneHotEncoder plus XGBClassifier/XGBRegressor combination. These two together perform non-sensical computations, which cannot be reproduced in sane environments.

jhaneyrf commented 1 year ago

I have two categorical inputs and the rest are numeric.

I do not use OneHotEncoder for my categorical inputs. I converted the categorical variables to numeric with dictionaries and LookupTransformer. In retrospect, the custom algorithm I used is similar to sklearn's TargetEncoder.

So PREPROCESSOR is a ColumnTransformer that passes through all the numerics and applies LookupTransformers to the categoricals.

PREPROCESSOR.transform(test_data) produces accurate output.

My XGBoost model does indeed have missing values. All I can say on that is that it looks like presence or absence of missing values in a particular row does not seem to be a clear separator between matching scores and mismatched scores.

I'll try your suggestion of fitting a LightGBM model instead, but I probably won't be able to get to that until Monday. Thanks for your help!

vruusmann commented 1 year ago

My XGBoost model does indeed have missing values.

Also, how are the missing values represented in test_data? Are they None for non-numerics, and numpy.NaN for numerics?

By any chance, does the test dataset contain boolean or boolean-like string features? The PMML converter is generating true and false constants (starts with lowercase letter), but Python may be operating with True and False (starts with uppercase letter).

If the number of features is not too high, then I'd also suggest opening the PMML document in text editor and simply taking a look at the contents of /PMML/DataDictionary/DataField elements. Are all feature descriptions (in the form of DataField elements) exactly as they are meant to be?

denmase commented 12 months ago

Hi,

I found same case before and it was also specific to XGBoost. In my case it was due to precision of numeric test data. I believe evaluator uses single for the precision. I'd suggest you to try to truncate the numerics to just few digits precision and test both using the same dataset.

vruusmann commented 12 months ago

I found same case before and it was also specific to XGBoost.

XGBoost is trickier than other algorithms, because it relies on its own missing/categorical data representations that are incompatible with Scikit-Learn defaults.

Hence, my first advice for debugging an XGBoost pipeline - "replace the final estimator step of a pipeline (XGBoost) with some other gradient boosting algorithm (GradientBoostingClassifier, LightGBM) and see if the error is resolved or not". This advice should be quick to act upon, and it will give a definite answer if the current Python pipeline is compatible with XGBoost's assumptions or not.

While it's technically possible to find Python-vs-(J)PMML incompatibility in 2023, it would be an extremely improbable event. There's almost always an easier explanation available.

I believe evaluator uses single for the precision

The XGBoost library/algorithm is operating with float32 values internally. If you work with its C(++) interface, then you'll be getting float32 data type results from all its API endpoints (eg. ClassifierMixin.predict_proba(X), RegressorMixin.predict(X)).

The Scikit-Learn wrapper for XGBoost is silently upcasting those float32 data type results to float64 data type results. IIRC, there is no way to suppress this upcast. Therefore, XGBClassifier and XGBRegressor appear to be giving float64-level predictions, but actually the majority of this "precision" (say, onwards from the sixth decimal place) is simply made up (by the float32 to float64 upcast operation).

I'd suggest you to try to truncate the numerics to just few digits precision and test both using the same dataset.

The (J)PMML library stack is returning XGBoost predictions as the original float32 values. It does not add fictional precision.

When embedding verification datasets using the PMMLPipeline.verify(X) method, then one should take this Python-vs-(J)PMML incompatibility into consideration, and explicitly lower verification thresholds to float32 levels.

For example:

pipeline = PMMLPipeline([
  ("estimator", XGBRegressor())
])
pipeline.fit(X, y)
# THIS: reduce equivalence criteria from 1e-13 (double64 level) to 1e-6 .. 1e-7 (float32 level)
pipeline.verify(X, precision = 1e-6, zeroThreshold = 1e-6)
denmase commented 12 months ago

The (J)PMML library stack is returning XGBoost predictions as the original float32 values. It does not add fictional precision.

Then probably it was the wrapper (I used a "wrapped" (J)PMML evaluator since it was a part of third-party solution, it was visually displayed as single on data mapping dialog although I made sure that the PMML was actually double). But then it was never happened using LightGBM or any other algorithm. The modeller insisted to use XGBoost so I tested this scenario over and over and with some luck, I tried to reduce the precision for both development and test data, only then I got identical result.

P.S.: Sorry for my "jump into (assumed) conclusion" comment. It was not an apple-to-apple situation.

jhaneyrf commented 12 months ago

I have a couple things to report.

First, when I said that the observations don't match, technically what I meant is that abs(score2 - score1) > 0.000001. I don't know if that's relevant to the discussion about numerical precision, but I wanted to bring it up.

Second, I haven't had a chance to run an alternative model, but I have an older iteration of the XGBoost model to play with. In that model, it is easier to identify why the mismatch occurs. The lookup transformation I apply to one of the categorical variables is something like this:

CATEGORICAL_DICT = { "Level 0": 0.0, "Level 1": 1.0, "Level 2": 2.0, "Level 3": 3.0}

[...] LookupTransformer(CATEGORICAL_DICT, default_value=3.0)

On this version of the model, everything matches except where this categorical variable is missing. When I manually recode the missing values to be "Level 3", I can get the scores to match. Setting the missing values as numpy.nan or Python's None object both cause the same mismatch in scores.

We had wanted to have the missing values handled inside the PMML file to avoid some bugs that arose when we pushed our last model into production, but I will be satisfied if I can get a version of the actual model that works with such a minor workaround.

When I read the PMML file, it looks like the LookupTransformer has been correctly implemented in the code, and running PREPROCESSOR.transform() produces the correct output as well, but in the combined pipeline the scores don't match.

So your hypothesis about xgboost's missing value handling is sounding like a very credible candidate for the problem. Outside model validators recommended that we not do missing imputation in our model builds since XGBoost doesn't need it, but it looks like it's impossible to get an accurate PMML file without it.

I will still try to build a quick LightGBM model to confirm this.

jhaneyrf commented 12 months ago

I've identified what the discrepancy is between the pipeline scores and the PMML scores. The PMML scores and the pipeline scores only differ when one of my categorical variables is missing. The LookupTransformer objects appear to correctly impute the missing values in the data when used on its own and as part of a pipeline, but in the PMML file the XGBoost model appears to be receiving missing values instead of imputed values. Perhaps this is what you're referring to when you've talked about XGBoost's special approach to handling missing values.

Now that I understand the issue, I can design my production workflow with the PMML file that gets created. (I just have to make sure that the missing values are imputed before the PMML scoring starts.) But I'm happy to provide additional information if it would be helpful for your future development work for this package.

Thanks for your quick responses that helped me get to the bottom of this!

P.S. I don't feel the need to pursue this, but I did try running a LightGBM model. I got the following error when I tried to create the PMML file:

Exception in thread "main" java.lang.IllegalArgumentException: The transformer object (Python class lightgbm.basic.Booster) is not a supported Transformer
        at org.jpmml.python.CastFunction.apply(CastFunction.java:47)
        at sklearn.pipeline.Pipeline$1.apply(Pipeline.java:106)
        at sklearn.pipeline.Pipeline$1.apply(Pipeline.java:97)
        at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:631)
        at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:52)
        at sklearn.Composite.encodeFeatures(Composite.java:138)
        at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:154)
        at com.sklearn2pmml.Main.run(Main.java:82)
        at com.sklearn2pmml.Main.main(Main.java:65)
Caused by: java.lang.ClassCastException: Cannot cast lightgbm.sklearn.Booster to sklearn.Transformer
        at java.lang.Class.cast(Unknown Source)
        at org.jpmml.python.CastFunction.apply(CastFunction.java:45)
        ... 8 more

I had LightGBM version 3.3.4 installed, and sklearn2pmml version 0.97.0.

vruusmann commented 11 months ago

when I said that the observations don't match, technically what I meant is that abs(score2 - score1) > 0.000001.

The value of 0.000001 (aka 1e-6) is already very close to the inherent precision of thefloat32 data type (around the 0 value). For XGBoost models, you should set your "bad prediction threshold" one or two orders of magnitudes higher, around 1e-5 or so. When you have such bad predictions, then there's real cause for concern.

Also, a technical note about reproducing XGBoost predictions on PMML.

Most predictions involve two computational stages:

  1. Summing the scores of member decision trees.
  2. Applying some sort of transformation to the above sum. For example, binary classifiers apply the inverse logit function there.

The first computational stage is perfectly reproducible between XGBoost and PMML. For example, all regression-type objectives should yield identical predictions (ie. within one ULP of each other).

The second stage may have some "systematic error" in it. For example, XGBoost uses exp() function, which is not compatible with Java's Math.exp() function (even on CPU platforms, not to mention GPU platforms). To make things worse, XGBoost tends to perform some bits of the computation using float32 values and some others with float64 values, which again arbitrarily adds/loses numeric precision.

The second stage is not compatible even between different XGBoost versions.

vruusmann commented 11 months ago

The PMML scores and the pipeline scores only differ when one of my categorical variables is missing.

You had your LookupTransformer object defined like this:

CATEGORICAL_DICT = {
  "Level 0": 0.0,
  "Level 1": 1.0,
  "Level 2": 2.0,
  "Level 3": 3.0
}
transformer = LookupTransformer(CATEGORICAL_DICT, default_value=3.0)

The important thing to not is that the LookupTransformer.default_value attribute does not deal with missing values at all! It only handles invalid values, aka previously unseen category levels!

See the description of the MapValues element: https://dmg.org/pmml/v4-4-1/Transformations.html#xsdElement_MapValues

If you'd like to use LookupTransformer for handling missing values as well, then we would need to define a new LookupTransformer.map_missing_to attribute/mechanism for that.

Right now, I would assume that when you apply LookupTransformer.transform(X) to a Python's None value (the default for missing object data type features), then you should also be getting None as the transformation result. If you get 3.0, then it's an error.

vruusmann commented 11 months ago

Closing this issue as "mostly figured out".

@jhaneyrf Please post your findings regarding the LookupTransformer into this new dedicated issue: https://github.com/jpmml/sklearn2pmml/issues/395

Perhaps a small Python script which shows how this transformer is behaving on your end wrt missing and invalid values. These two value spaces should transform differently.

vruusmann commented 10 months ago

The original issue was mostly caused about LookupTransformer replacing missing values with LookupTransformer.default_value values, whereas it should have been returning missing values unchanged.

This issue has been fixed in SkLearn2PMML version 0.99.1.