jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
892 stars 255 forks source link

Configuration option to disable/relax sanity checks #89

Closed vruusmann closed 6 years ago

vruusmann commented 6 years ago

The JPMML-Evaluator library keeps introducing more sanity checks. As a result, newer versions of JPMML-Evaluator may refuse to score PMML documents (by throwing an InvalidFeatureException) that were gladly accepted/tolerated by older versions.

For example, JPMML-Evaluator version 1.3.7 (and newer) require that for classification-type tree models, the values of ScoreDistribution@probability attribute must sum exactly to 1.0 for each Node element.

The following Node element is considered to be invalid, because the sum of probabilities is 0.999999999999999 not 1.0:

<Node score="good">
    <SimplePredicate field="Age" operator="lessOrEqual" value="31.5"/>
    <ScoreDistribution value="bad" probability="0.254716981132075"/>
    <ScoreDistribution value="good" probability="0.745283018867924"/>
</Node>

The "offending" sanity check: https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/tree/TreeModelEvaluator.java#L361-L364

vruusmann commented 6 years ago

Two possible solutions:

  1. Modify the JPMML-Evaluator library so that 0.999999999999999 is considered "close enough" to 1.0.
  2. Modify the PMML document, and re-compute/re-normalize probabilities so that they would sum exactly to 1.0. This could be easily implemented using the Visitor API from the JPMML-Model library.

Of course, the real source of the problem is bad PMML producer software.

vruusmann commented 6 years ago

Here's a small command-line application to compute the size of "delta" in terms of ULPs:

public class Main {

    static
    public void main(String[] args){
        double left = Double.parseDouble(args[0]);
        double right = Double.parseDouble(args[1]);

        double sum = left + right;
        System.out.println("sum: " + sum);

        double delta = 1d - sum;
        System.out.println("delta: " + delta);
        System.out.println("delta in ULPs: " + (delta / Math.ulp(1d)));
    }
}

The first application run shows that there's currently a delta of 4 ULPs:

$ java Main 0.254716981132075 0.745283018867924
sum: 0.9999999999999991
delta: 8.881784197001252E-16
delta in ULPs: 4.0

The second application run shows that if these probability values were represented with an extra decimal place (eg. by appending 5 to both number literals), then the delta would be 0 ULP (and there would be no scoring problem):

$ java Main 0.2547169811320755 0.7452830188679245
sum: 1.0
delta: 0.0
delta in ULPs: 0.0

In conclusion, the PMML producer software has been emitting "imprecise" probability values.