jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
895 stars 255 forks source link

Default value handling for constant evaluation #217

Closed w23html closed 3 years ago

w23html commented 3 years ago

I used to have a model (built with 1.4.x version of JPMML library) with the following decision tree function:

<DerivedField  name="ReplaceRegexVar"  optype="categorical" dataType="string">
      <Apply function="replace" >
              <FieldRef field="inputA"/>
              <Constant>Z+</Constant>
              <Constant></Constant>
      </Apply>
</DerivedField>

Currently when I do version upgrade for the PMML library from 1.4.x to 1.5.x, I found that the function behavior was changed:

As I understanding this seems to be a breakable change for the old model compatibility? Can someone help to confirm which should be the correct behavior?

vruusmann commented 3 years ago

As I understanding this seems to be a breakable change for the old model compatibility?

The PMML schema version 4.4 extended the Constant element with the optional missing attribute: http://mantis.dmg.org/view.php?id=216 http://dmg.org/pmml/v4-4-1/Transformations.html#xsdElement_Constant

The interpretation of <Constant></Constant> now varies depending on the absence/presence of dataType and missing attributes.

Can someone help to confirm which should be the correct behavior?

The 1.4.X version is generating PMML schema version 4.3 documents, where <Constant></Constant> means an empty string.

The 1.5.X version is generating PMML schema version 4.4 documents, where <Constant></Constant> means a missing string. However, I believe that there's an extra dataType attribute available now like this <Constant dataType="string"></Constant>, which would change the interpretation back to an empty string.

vruusmann commented 3 years ago

The org.jpmml.model.filters.ImportFilter SAX filter class should be extended with the following schema version upgrade logic: "If schema version == 4.3 and element matches 'empty Constant element', then add Constant@dataType='string'".

If you have legacy PMML documents then you'd need to perform this change manually.

w23html commented 3 years ago

The org.jpmml.model.filters.ImportFilter SAX filter class should be extended with the following schema version upgrade logic: "If schema version == 4.3 and element matches 'empty Constant element', then add Constant@dataType='string'".

Thanks for the reply. For the logic can I expect that this will be added to the next release version of the library? I feel that it's not good to maintain a private branch for the code which could be extremely hard for the later version upgrade.

vruusmann commented 3 years ago

I feel that it's not good to maintain a private branch for the code ...

I was suggesting that you should modify your existing PMML schema version 4.3 documents (not my library or your application code).

Right now you have:

<Constant></Constant>

Change it to the following, and your rogue missing value problem will be solved:

<Constant dataType="string"></Constant>

It's quite surprising that your PMML documents don't have the Constant@dataType attribute available. Must be a very old file, because if I remember correctly, all JPMML family conversion libraries have been defining it for the past 3+ years (mostly for evaluation performance reasons, because inferring constant's data type from its string representation is rather expensive).

vruusmann commented 3 years ago

FWIW, here's the relevant JPMML-Evaluator code snippet: https://github.com/jpmml/jpmml-evaluator/blob/1.5.13/pmml-evaluator/src/main/java/org/jpmml/evaluator/ExpressionUtil.java#L202-L239

The comment on line 217 states clearly that you can evade the "" -> missing conversion by adding Constant@dataType="string".

w23html commented 3 years ago

<Constant dataType="string"></Constant>

Adding dataType for empty string constant does works to prevent interpreting as missing value during evaluation time. Thanks for suggestion.

It's quite surprising that your PMML documents don't have the Constant@dataType attribute available. Must be a very old file, because if I remember correctly, all JPMML family conversion libraries have been defining it for the past 3+ years (mostly for evaluation performance reasons, because inferring constant's data type from its string representation is rather expensive).

The model is actually one of our unit test model, and yes it's introduced in 2017 (3 years ago). However the version of the PMML file is 4.3 (first line: <PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">). For now I can't tell whether this document is directly generated from the library or manually created on purpose for edge-case test. Can I assume that all library generated PMML documents (e.g., jpmml-xgboost) are always having dataType attribute assigned to the Constant?

vruusmann commented 3 years ago

Restored this issue with its original location and title.

My previous analysis/explanation was incorrect - this is a real bug in the JPMML-Evaluator library (not a missing feature in the JPMML-Model library).