jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Bad scoping of target field(s) in stacking estimators #192

Closed git20190108 closed 7 months ago

git20190108 commented 7 months ago

hi,vruusmann
I find a problem with passing scope value,the xml content of <OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/> in my pmml file is invalid, after I rewrite the value ,it works Only the lgb predict_proba always return 0, other model is ok. It seems the package can't recognize predict_proba(0, 1)'s value.

before:

<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/>

after:

<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="transformedValue">
    <FieldRef field="probability(1)"/>
</OutputField>

file detail:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="SkLearn2PMML package" version="0.100.0"/>
        <MiningBuildTask>
        <Extension name="repr">PMMLPipeline(steps=[('Stacking_model', StackingClassifier(estimators=[('lgbm',
                                LGBMClassifier(***)),
                               ('rf',
                                RandomForestClassifier***)),
                               ('MLP',
                                MLPClassifier(***)),
                               ('GNB', GaussianNB(***))],
                   final_estimator=LogisticRegression(***)))])</Extension>
        </MiningBuildTask>

    <RegressionTable intercept="-2.507414415" targetCategory="1">
        <NumericPredictor name="predict_proba(0, 1)" coefficient="3"/>
        <NumericPredictor name="predict_proba(1, 1)" coefficient="1"/>
        <NumericPredictor name="predict_proba(2, 1)" coefficient="2"/>
        <NumericPredictor name="predict_proba(3, 1)" coefficient="5"/>
    </RegressionTable>
</PMML>

packages: jpmml-evaluator-python: 0.10.1 java: "1.8.0_211" Python: 3.9.17

script:

from jpmml_evaluator import make_evaluator
evaluator = make_evaluator(  '***.pmml', reporting = True,backend='py4j').verify() 
evaluator.evaluate(input1)

detail.txt

git20190108 commented 7 months ago

image

vruusmann commented 7 months ago

Fixed the formatting for you. According to GitHub MarkDown conventions, you should be surrounding code blocks with three backtick symbols, and inline code fragments with a single backtick symbol

vruusmann commented 7 months ago

If there is a problem, and you solve it by manually editing the PMML document, then this typically indicates a converter-side bug, not an evaluator-side bug.

Therefore, I'm moving this issue over to the JPMML-SkLearn project, because this is the component that is actually responsible for generating OutputField element names and making sure that they are properly scoped.

git20190108 commented 7 months ago

Bad scoping of LGBMClassifier probability output fields within StackingClassifier? --yes -- The first LGBMClassifier always return 0 probability,and the x-report is not work.

vruusmann commented 7 months ago

The first LGBMClassifier always return 0 probability

If you move the LGBClassifier to the second position, does it work then?

Anyway, will be creating a small test script to experience this issue on my own computer. Perhaps it affects all third-party classifiers, such as H2O, LightGBM and XGBoost.

One thing that intrigues me is that the converter is unable to detect the output field scoping issue. This PMML document should fail already in the conversion phase.

git20190108 commented 7 months ago

The first LGBMClassifier always return 0 probability

If you move the LGBClassifier to the second position, does it work then?

-- no , still not work. position is not the reason

vruusmann commented 7 months ago

Here's my test script - train a stacking classifier for a binary classification problem using SkLearn, LightGBM and XGBoost classifiers, then convert it to a PMML document, and then load and evaluate this PMML document using the JPMML-Evaluator-Python package:

from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

X, y = load_iris(return_X_y = True, as_frame = True)
# Convert to binary classification problem
y = (y == 1)

classifier = StackingClassifier(
    estimators = [
        ("sklearn", LogisticRegression()),
        ("lightgbm", LGBMClassifier(n_estimators = 3)),
        ("xgboost", XGBClassifier(n_estimators = 3))
    ],
    final_estimator = LogisticRegression()
)
classifier.fit(X, y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(classifier, "StackingClassifier.pmml")

from jpmml_evaluator import make_evaluator

evaluator = make_evaluator("StackingClassifier.pmml", reporting = True, backend = 'py4j') \
    .verify() 

X_pmml = DataFrame(X.values, columns = X.columns.values.tolist())

yt = evaluator.evaluateAll(X_pmml)
print(yt)

Works absolutely flawlessly. The LightGBM classifier can be moved to any position within the stacking classifier, and everything keeps working just like before.

vruusmann commented 7 months ago

@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.

git20190108 commented 7 months ago

@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.

Does this normal ? image test.pmml.txt

git20190108 commented 7 months ago

add pypmml result image

vruusmann commented 7 months ago

Does this normal ?

Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as 0 values in JPMML-Evaluator-Python results"?

Very interesting indeed. Am exploring.

git20190108 commented 7 months ago

Does this normal ?

Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as 0 values in JPMML-Evaluator-Python results"?

yes due to the wrong intermediate results,the final result also mistake。 apparently,pypmml get the normal result with the same file

vruusmann commented 7 months ago

due to the wrong intermediate results,the final result also mistake。

Seems like a data transfer error somewhere in the Python wrapper.

Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.

git20190108 commented 7 months ago

due to the wrong intermediate results,the final result also mistake。

Seems like a data transfer error somewhere in the Python wrapper.

Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.

yes only this part wrong, seems this part can't get the correct value.

<Output>
    <OutputField name="predict_proba(1, true)" optype="continuous" dataType="double" feature="probability" value="true" isFinalResult="false"/>
</Output>
vruusmann commented 7 months ago

This issue is about two things.

First, the JPMML-SkLearn converter library is generating incorrect PMML documents for bothStackingClassifier and StackingRegressor estimator types. The problem is that the name of the target fields is being passed by the top-level stacking estimator to its member estimators. Instead, it should be "anonymizing" the schema, so that member estimators get to see an "anonymized" target field (ie. the name is null).

The fix is straightforward, simply replace schema with schema.toSegmentSchema() on this line: https://github.com/jpmml/jpmml-sklearn/blob/1.7.47/pmml-sklearn/src/main/java/sklearn/ensemble/stacking/StackingUtil.java#L56

Existing PMML documents can be fixed by simply deleting the <MiningField name="y" usageType="target"/> fragment from member model schemas. This declaration is only permitted with the top-level model element (ie. /PMML/MiningModel).

Second, the JPMML-Evaluator-Python gets confused that it is requested to re-define the target field over and over again (first with member models "sklearn", "lightgbm" and "xgboost"; and then finally at the top-level). Right now, it simply retains and returns the first (partial-) definition.

According to the PMML specification, it should be an error to re-define the value of some field when moving from one model chain element to another.

Therefore, the correct behaviour for any PMML engine would be to fail with an error here. The JPMML-Evaluator Java library is not doing it, which needs fixing. Its Python wrapper is currently even worse, because it returns a partial result.

vruusmann commented 7 months ago

TLDR: There are fixes needed in two locations:

  1. The JPMML-SkLearn library should "anonymize" the schema before passing it from the parent/top-level model to child/member models.
  2. The JPMML-Evaluator library should error out when it is presented with a model chain, where sibling models attempt to re-define the value of a target field (IIRC, right now it only checks for the re-definition of output fields).

The fact that PyPMML "works" is no argument, because PyPMML does not perform any PMML document sanity/validity checks on its own. It's too stupid for that.

vruusmann commented 7 months ago

Existing PMML documents can be fixed by simply deleting the fragment from member model schemas

The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending MiningModel elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.

git20190108 commented 7 months ago

Existing PMML documents can be fixed by simply deleting the fragment from member model schemas

The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending MiningModel elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.

I will use your method to fix the previous script Thank you for your patient explanation and I look forward to your fixing these issues.

vruusmann commented 7 months ago

I will use your method to fix the previous script

Using my "StackingClassifier.pmml" file as an example:

You should keep:

You should delete:

This keep/delete transformation can probably be automated using an XSLT stylesheet. But I'm too lazy to work on it now.

I will fix the conversion part of this issue in the next SkLearn2PMML package release. Probably sometimes next week.

@git20190108 You shall receive a GitHub notification when this issue gets closed. After that, update your SkLearn2PMML package version, and everything should work fine.

Also, thanks for spotting and reporting this issue to me! Much appreciated.