Closed git20190108 closed 7 months ago
Fixed the formatting for you. According to GitHub MarkDown conventions, you should be surrounding code blocks with three backtick symbols, and inline code fragments with a single backtick symbol
If there is a problem, and you solve it by manually editing the PMML document, then this typically indicates a converter-side bug, not an evaluator-side bug.
Therefore, I'm moving this issue over to the JPMML-SkLearn project, because this is the component that is actually responsible for generating OutputField
element names and making sure that they are properly scoped.
Bad scoping of LGBMClassifier probability output fields within StackingClassifier? --yes -- The first LGBMClassifier always return 0 probability,and the x-report is not work.
The first LGBMClassifier always return 0 probability
If you move the LGBClassifier
to the second position, does it work then?
Anyway, will be creating a small test script to experience this issue on my own computer. Perhaps it affects all third-party classifiers, such as H2O, LightGBM and XGBoost.
One thing that intrigues me is that the converter is unable to detect the output field scoping issue. This PMML document should fail already in the conversion phase.
The first LGBMClassifier always return 0 probability
If you move the
LGBClassifier
to the second position, does it work then?
-- no , still not work. position is not the reason
Here's my test script - train a stacking classifier for a binary classification problem using SkLearn, LightGBM and XGBoost classifiers, then convert it to a PMML document, and then load and evaluate this PMML document using the JPMML-Evaluator-Python package:
from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
X, y = load_iris(return_X_y = True, as_frame = True)
# Convert to binary classification problem
y = (y == 1)
classifier = StackingClassifier(
estimators = [
("sklearn", LogisticRegression()),
("lightgbm", LGBMClassifier(n_estimators = 3)),
("xgboost", XGBClassifier(n_estimators = 3))
],
final_estimator = LogisticRegression()
)
classifier.fit(X, y)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(classifier, "StackingClassifier.pmml")
from jpmml_evaluator import make_evaluator
evaluator = make_evaluator("StackingClassifier.pmml", reporting = True, backend = 'py4j') \
.verify()
X_pmml = DataFrame(X.values, columns = X.columns.values.tolist())
yt = evaluator.evaluateAll(X_pmml)
print(yt)
Works absolutely flawlessly. The LightGBM classifier can be moved to any position within the stacking classifier, and everything keeps working just like before.
@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.
@git20190108 The burden of proof is now on you - please take my test script, and "break it" so that it would start giving the same error that you were seeing in your own script before.
Does this normal ? test.pmml.txt
add pypmml result
Does this normal ?
Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as 0
values in JPMML-Evaluator-Python results"?
Very interesting indeed. Am exploring.
Does this normal ?
Think I got your question now - "when I try to export the intermediate results of the stacking ensemble classifier, then why do they show up as
0
values in JPMML-Evaluator-Python results"?
yes due to the wrong intermediate results,the final result also mistake。 apparently,pypmml get the normal result with the same file
due to the wrong intermediate results,the final result also mistake。
Seems like a data transfer error somewhere in the Python wrapper.
Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.
due to the wrong intermediate results,the final result also mistake。
Seems like a data transfer error somewhere in the Python wrapper.
Because when I evaluate the same PMML document with JPMML-Evaluator command-line application, I get predictions that match SkLearn native predictions, plus all the intermediate LightGBM, XGBoost etc. values.
yes only this part wrong, seems this part can't get the correct value.
<Output>
<OutputField name="predict_proba(1, true)" optype="continuous" dataType="double" feature="probability" value="true" isFinalResult="false"/>
</Output>
This issue is about two things.
First, the JPMML-SkLearn converter library is generating incorrect PMML documents for bothStackingClassifier
and StackingRegressor
estimator types. The problem is that the name of the target fields is being passed by the top-level stacking estimator to its member estimators. Instead, it should be "anonymizing" the schema, so that member estimators get to see an "anonymized" target field (ie. the name is null
).
The fix is straightforward, simply replace schema
with schema.toSegmentSchema()
on this line:
https://github.com/jpmml/jpmml-sklearn/blob/1.7.47/pmml-sklearn/src/main/java/sklearn/ensemble/stacking/StackingUtil.java#L56
Existing PMML documents can be fixed by simply deleting the <MiningField name="y" usageType="target"/>
fragment from member model schemas. This declaration is only permitted with the top-level model element (ie. /PMML/MiningModel
).
Second, the JPMML-Evaluator-Python gets confused that it is requested to re-define the target field over and over again (first with member models "sklearn", "lightgbm" and "xgboost"; and then finally at the top-level). Right now, it simply retains and returns the first (partial-) definition.
According to the PMML specification, it should be an error to re-define the value of some field when moving from one model chain element to another.
Therefore, the correct behaviour for any PMML engine would be to fail with an error here. The JPMML-Evaluator Java library is not doing it, which needs fixing. Its Python wrapper is currently even worse, because it returns a partial result.
TLDR: There are fixes needed in two locations:
The fact that PyPMML "works" is no argument, because PyPMML does not perform any PMML document sanity/validity checks on its own. It's too stupid for that.
Existing PMML documents can be fixed by simply deleting the
fragment from member model schemas
The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending MiningModel
elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.
Existing PMML documents can be fixed by simply deleting the fragment from member model schemas
The above test script produces a "StackingClassifier.pmml" file. When this file is opened in a text editor, and the offending
MiningModel
elements are deleted manually (I see five of them), then the JPMML-Evaluator-Python makes correct predictions (including the export of intermediate probabilities) already now.
I will use your method to fix the previous script Thank you for your patient explanation and I look forward to your fixing these issues.
I will use your method to fix the previous script
Using my "StackingClassifier.pmml" file as an example:
You should keep:
/PMML/MiningModel/MiningSchema/MiningField@name="y"
ie. the very first occurrence/PMML/MiningModel/Segmentation/Segment@id="4"/RegressionModel/MiningSchema/MiningField@name="y"
ie. the very last occurrenceYou should delete:
/PMML/MiningModel/Segmentation/Segment@id="1"
/PMML/MiningModel/Segmentation/Segment@id="2"
/PMML/MiningModel/Segmentation/Segment@id="3"
This keep/delete transformation can probably be automated using an XSLT stylesheet. But I'm too lazy to work on it now.
I will fix the conversion part of this issue in the next SkLearn2PMML package release. Probably sometimes next week.
@git20190108 You shall receive a GitHub notification when this issue gets closed. After that, update your SkLearn2PMML package version, and everything should work fine.
Also, thanks for spotting and reporting this issue to me! Much appreciated.
hi,vruusmann
I find a problem with passing scope value,the xml content of
<OutputField name="predict_proba(0, 1)" optype="continuous" dataType="double" feature="probability" value="1" isFinalResult="false"/>
in my pmml file is invalid, after I rewrite the value ,it works Only the lgb predict_proba always return 0, other model is ok. It seems the package can't recognize predict_proba(0, 1)'s value.before:
after:
file detail:
packages: jpmml-evaluator-python: 0.10.1 java: "1.8.0_211" Python: 3.9.17
script:
detail.txt