jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
686 stars 113 forks source link

PMML version downgrade is blocked by `Version#XPMML`-annotated vendor extension markup #433

Closed pchitimi closed 2 weeks ago

pchitimi commented 2 months ago

Hello! I am seeing the following issue when attempting to export a model Pipeline to PMML 4.3. I am uncertain if the model requires at least 4.4 or if there are other issues at play here.

Exception in thread "main" java.lang.UnsupportedOperationException
    at org.dmg.pmml.Version$1.getVersion(Version.java:23)
    at com.sklearn2pmml.Main.run(Main.java:107)
    at com.sklearn2pmml.Main.main(Main.java:84)`

Using the debug flag, the output I observe is as follows:

python: 3.10.14
sklearn2pmml: 0.110.0
sklearn: 1.3.2
pandas: 2.2.2
numpy: 1.26.4
dill: 0.3.8
joblib: 1.4.2
openjdk: 21.0.4
Executing command:
java -cp /opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/sklearn2pmml-1.0-SNAPSHOT.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/gson-2.10.1.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/guava-33.0.0-jre.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/h2o-genmodel-3.46.0.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/h2o-logger-3.46.0.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/h2o-tree-api-0.3.17.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/istack-commons-runtime-4.0.1.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/jackson-annotations-2.13.3.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/jakarta.activation-2.0.1.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/jakarta.xml.bind-api-3.0.1.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/jaxb-core-3.0.2.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/jaxb-runtime-3.0.2.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/jcommander-1.72.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pickle-1.5.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-converter-1.5.6.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-h2o-1.2.12.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-lightgbm-1.5.3.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-model-1.6.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-model-metro-1.6.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-python-1.2.2.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-sklearn-1.8.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-sklearn-extension-1.8.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-sklearn-h2o-1.8.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-sklearn-lightgbm-1.8.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-sklearn-statsmodels-1.8.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-sklearn-xgboost-1.8.4.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-statsmodels-1.1.0.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/pmml-xgboost-1.8.5.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/serpent-1.40.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/slf4j-api-1.7.36.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.36.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/ubjson-0.1.8.jar:/opt/anaconda3/lib/python3.10/site-packages/sklearn2pmml/resources/ubjson-gson-0.1.8.jar com.sklearn2pmml.Main --pkl-input /var/folders/sh/qjnmlk9d3271rj192qp33swc0000gr/T/estimator-ebmpdaxc.pkl.z --pmml-output pda_xgb_raw.pmml --pmml-schema 4.3
Standard output is empty
Standard error:
Exception in thread "main" java.lang.UnsupportedOperationException
    at org.dmg.pmml.Version$1.getVersion(Version.java:23)
    at com.sklearn2pmml.Main.run(Main.java:107)
    at com.sklearn2pmml.Main.main(Main.java:84)`

Thank you for your assistance!

vruusmann commented 2 months ago

I am seeing the following issue when attempting to export a model Pipeline to PMML 4.3.

Exception in thread "main" java.lang.UnsupportedOperationException at org.dmg.pmml.Version$1.getVersion(Version.java:23)

This exception is thrown by the Version#XPMML special enum constant: https://github.com/jpmml/jpmml-model/blob/1.6.5/pmml-model/src/main/java/org/dmg/pmml/Version.java#L16-L25

It means that your model may be PMML 4.3 compatible, but it "contains" some vendor extensions - a XML markup (typically, some XML attribute) which is not part of the PMML specification.

Anyway, the good news is that if you are using JPMML converters (one the Python ML side), and JPMML evaluators (on the Java application side), then this vendor extension is likely to be recognized/supported in both PMML 4.3 and 4.4 modes.

The SkLearn2PMML package should contain special logic for dealing with vendor extensions. The Version#XPMML enum constant is not a standalone PMML version per se. It's more like a "mask" on top of some valid PMML version such as PMML 4.3 or 4.4 (to be interpreted as "PMML 4.3 with some JPMML-specific attributes").

vruusmann commented 2 months ago

Now, thinking about this issue, then I can think of the following improvements:

vruusmann commented 2 months ago

@pchitimi What you can try right now to clarify the situation: export your model using the default (ie. latest) PMML schema version, and open it in a text editor; then, search for XML element and attributes whose name starts with "x-" (letter "X" followed by hypen). How many/which can you find?

If it's only or two pieces of markup, we can verify them together, and you can then proceed to perform the version downgrade manually - by editing the XML namespace declaration.

vruusmann commented 2 months ago

Thinking about this issue, then I can think of the following improvements:

Also, perhaps the version downgrade functionality should be available as a separate SkLearn2PMML utility function.

This functionality does (potentially-) have many controlling options. Adding them to the main sklearn2pmml.sklearn2pmml utility function as extra parameters would complicate the situation too much.

pchitimi commented 2 months ago

Thank you very much for the detailed response including the potential improvement paths @vruusmann!

As per your guidance, I was able to identify 4 unique (97 total) XML element/attributes whose name starts with "x-":

<MiningModel functionName="regression" x-mathContext="float">
<MiningModel functionName="classification" algorithmName="XGBoost (GBTree)" x-mathContext="float">
<RegressionModel functionName="classification" normalizationMethod="logit" x-mathContext="float">
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
vruusmann commented 2 months ago

I was able to identify 4 unique XML element/attributes whose name starts with "x-"

They are all <Model>@x-mathContext attributes, which instruct the JPMML evaluator to carry out all model-internal computations using 32-bit floating point data type/math operations (the default would be 64-bit).

Fundamentally, this particular attribute can be omitted without breaking the underlying model (the predicted results will come out with extra precision, which qualifies as "noise"). It is a very ancient vendor extension, which should be recognized by all JPMML-Evaluator 1.4.X and newer versions.

Anyway, my expectation is that the SkLearn2PMML package should never fail because of the <Model>@x-mathContext attribute.

The trouble is that this attribute is always present for XGBoost models.

pchitimi commented 2 months ago

Gotcha, just to make sure I understand:

Is my understanding correct or did I miss anything?

vruusmann commented 2 months ago

Is my understanding correct or did I miss anything?

Yes, these two changes should achieve the "PMML schema version downgrade" from 4.4 to 4.3 for XGBoost models.

For comparison, you may train a toy LightGBM model (structurally very similar to XGBoost models), and do the following:

Then diff these two files (eg. using the command-line diff tool) - you will see exactly what was changed, line by line. Should the the XML namespace URL, and the PMML@version attribute values (the latter being non-critical).

LightGBM models don't need the <Model>@x-mathContext attribute, so the conversion should succeed every time.

vruusmann commented 1 week ago

I've just released SkLearn2PMML 0.111.1 to PyPI, which permits the PMML schema version downgrade to proceed even if there are incompatibilities around.

The full list of incompatibilities are printed to the console; the person performing the conversion can review and correct them manually if necessary.

For example, the console log when converting/downgrading an example XGBAudit model to PMML 4.3:

SEVERE: The PMML object has 2 incompatibilities with the requested PMML schema version:
WARNING: Attribute with value Segmentation@missingPredictionTreatment=returnMissing is not supported (2 cases)

The above log shows that there are two Segmentation@missingPredictionTreatment attributes around, which are not part of PMML 4.3 (they were introduced in PMML 4.4). However, this qualifies as a "ignorable" incompatibility, because the presence/absence of this attribute does not change the actual prediction logic - it's a modifier that instructs the predictor to safely/cleanly terminate the prediction process if the first stage of the XGBoost model yielded a missing result (due to one or more missing input values).