jpmml / jpmml-evaluator

Java Evaluator API for PMML
GNU Affero General Public License v3.0
892 stars 255 forks source link

[Question]: Missing Values in Segment #54

Closed infiton closed 6 years ago

infiton commented 7 years ago

Suppose I have a Segmentation that contains Segments that wrap regression trees. If one of the regression trees returns a missing value (i.e. its missing value strategy is nullPrediction) how should the multiple model method handle the missing values.

It looks like the jpmml implementation will cast the null to 0 (https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/mining/MiningModelEvaluator.java#L639), this makes sense to me, however I can't find where that is outlined in the spec.

Specifically the case of missing values does not seem to be discussed http://dmg.org/pmml/v4-3/MultipleModels.html

vruusmann commented 7 years ago

If one of the regression trees returns a missing value (i.e. its missing value strategy is nullPrediction) how should the multiple model method handle the missing values.

Aggregation functions cannot be applied to missing values.

If a member model returns a missing value, then the evaluation should be terminated abruptly by propagating this missing value to the top level.

It looks like the jpmml implementation will cast the null to 0

No, it's impossible to cast a missing value to a valid value (such as 0).

The method SegmentResult#getTargetValue(DataType) would throw an org.jpmml.evaluator.TypeCheckException stating that "Expected <DataType>, bot got null". This is a hugely confusing exception message for end users.

infiton commented 7 years ago

verified that the exception is the outcome:

Exception in thread "main" org.jpmml.evaluator.TypeCheckException (at or around line 22): Expected DOUBLE, but got null
    at org.jpmml.evaluator.TypeUtil.toDouble(TypeUtil.java:670)
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:453)
    at org.jpmml.evaluator.mining.SegmentResult.getTargetValue(SegmentResult.java:82)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.aggregateValues(MiningModelEvaluator.java:639)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateRegression(MiningModelEvaluator.java:232)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:204)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:185)
    at org.jpmml.evaluator.EvaluationExample.execute(EvaluationExample.java:248)
    at org.jpmml.evaluator.Example.execute(Example.java:85)
    at org.jpmml.evaluator.EvaluationExample.main(EvaluationExample.java:149)
vruusmann commented 7 years ago

In the light of the above comment, perhaps segmentation models should use the following pattern:

for(SegmentResult segmentResult : segmentResults){
  // If the member model returned a missing value, then propagate it safely to the top level
  if(!segmentResult.hasTargetValue()){
    return null;
  }
  Double value = (Double)segmentResult.getTargetValue(DataType.DOUBLE);
  // Proceed as usual
}
ronry commented 7 years ago

but, What's the cause of the problem? and what can i do ,when this happened?

vruusmann commented 7 years ago

Asked the DMG.org to clarify the handling of missing segment scoring results: http://mantis.dmg.org/view.php?id=178

@ronry The exception "Expected DOUBLE, but got null" is typically caused by a missing input value. Have your prepared all your input fields correctly via org.jpmml.evaluator.InputField#prepare(Object)? If you did, and are still getting this exception, then you should "harden" your model schema. For example, you should define the MiningField@missingValueReplacement attribute for all input fields that can contain missing values.

ronry commented 7 years ago

sorry, my exception is

Exception in thread "main" org.jpmml.evaluator.TypeCheckException (at or around line 2): Expected org.jpmml.evaluator.HasProbability, but got null
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:485)
    at org.jpmml.evaluator.mining.SegmentResult.getTargetValue(SegmentResult.java:92)
    at org.jpmml.evaluator.mining.MiningModelUtil.aggregateProbabilities(MiningModelUtil.java:169)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:302)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:220)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:186)

and my code is

public void process(Map<String,String> context) {
        Map<FieldName, Object> inputs = new LinkedHashMap<>();
        List<InputField> inputFields = pmmlEvaluator.getActiveFields();
        for (InputField inputField : inputFields) {
            FieldName inputFieldName = inputField.getName();
            final Object rawValue = context.get(inputFieldName.getValue());
            inputs.put(inputFieldName, inputField.prepare(rawValue));
        }
        pmmlEvaluator.evaluate(inputs);
    }

Is it the same problem? I has checked inputs,all of them has value

vruusmann commented 7 years ago

These two exceptions - Expected DOUBLE, but got null and Expected org.jpmml.evaluator.HasProbability, but got null - are the same thing. The former happens with regression-type ensemble models (member predictions are double values), whereas the latter happens with classification-type ensemble models (member predictions are probability distributions).

@ronry In your code, you should be invoking Evaluator#getInputFields(), not Evaluator#getActiveFields() (this is a breaking API change between JPMML-Evaluator 1.2.X and 1.3.X versions). The set of "active fields" is a subset of "input fields". It is possible that this code change fixes the problem for you. Otherwise, you should be working on "hardening" the model schema by defining missing value replacement values for all input fields.

nyug commented 6 years ago

Hello VR,

I face the same issue with my RF model. I have generated the PMML using r2pmml and using the "EvaluationExample" to score the same, but I get an exception that states "Expected double value, got missing value (null)"

Trace:

Exception in thread "main" org.jpmml.evaluator.TypeCheckException (at or around line 87 of the PMML document): Expected double value, got missing value (null)
    at org.jpmml.evaluator.TypeUtil.toDouble(TypeUtil.java:687)
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:466)
    at org.jpmml.evaluator.mining.MiningModelUtil.aggregateValues(MiningModelUtil.java:75)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateRegression(MiningModelEvaluator.java:271)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:233)
    at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:205)
    at org.jpmml.evaluator.EvaluationExample.execute(EvaluationExample.java:298)
    at org.jpmml.evaluator.Example.execute(Example.java:86)
    at org.jpmml.evaluator.EvaluationExample.main(EvaluationExample.java:180)

But if I review my input csv file (like you had suggested above) and removed all the "NA's" in my file, the EvaluationExample ran correctly.

Does this exception occur because of missing input values and hence missing values in segments of the model? When you say that we could avoid this error by "hardening schema" -by adding MiningField@missingValueReplacement attribute for all input fields(in my case NA), can you please give an example of where/how it can be added to the code? (I believe in the EvaluationExample?...could you please confirm)

Thanks in advance.

vruusmann commented 6 years ago

The DMG has provided clarification about missing value handling (see the above link), and the suggested behaviour should become available in JPMML-Model/JPMML-Evaluator fairly soon.

@nyug In case of R's randomForest model type, the suggested behaviour would be expressed as Segmentation@missingResultTreatment="returnMissing", which means that when one of the member decision trees returns a missing prediction, then the ensemble as a whole should (immediately-) return a missing prediction.

This behaviour would be consistent with R's behaviour - if you invoke predict.randomForest function with missing data, then you'd be getting missing predictions back as well.

We could avoid this error by "hardening schema" -by adding MiningField@missingValueReplacement attribute for all input fields.

Correct. However, you would need to modify the contents of the PMML file, not tweak some JPMML-Evaluator configuration options. If it's a one-time activity, then it can be done in a text editor. If it's a more frequent activity, then it should be done programmatically. Of course, it would be nice if R2PMML/JPMML-R could auto-generate this attribute when appropriate.

@nyug If the above is critical for your use case, then please open a dedicated feature request at one of the R2PMML/JPMML-R projects. There are many ways how such "schema hardening" functionality could be implemented, and it would be nice to have them discussed/documented properly.

nyug commented 6 years ago

Thank you so much for the explanation. Definitely look forward to the DMG suggested behaviors for these model types. In parallel, it would extremely helpful to have this kind of "hardening schema" mechanisms in place that would handle missing values, NA's and prevent these tuples from even being evaluated in the first place (please correct me if my understanding is wrong). Sure, I can open a dedicated feature request in the r2pmml project. Many thanks for your suport.

wzxiong commented 5 years ago

I got tons of type error, and I can't trace the reason: expected: 'DOUBLE', got: '4.0'. error: org.jpmml.evaluator.InvalidResultException I try to change the input to 4 str(4) float(4) decimal(4), but none of them can pass

vruusmann commented 5 years ago

expected: 'DOUBLE', got: '4.0'. error: org.jpmml.evaluator.InvalidResultException

@wzxiong The exception type o.j.e.InvalidResultException is related to the data type of the target field (aka label).

Such type exceptions generally indicate an invalid/badly generated PMML document. Which software did you use to generate your PMML document - must be some non JPMML-family software?

wzxiong commented 5 years ago

expected: 'DOUBLE', got: '4.0'. error: org.jpmml.evaluator.InvalidResultException

@wzxiong The exception type o.j.e.InvalidResultException is related to the data type of the target field (aka label).

Such type exceptions generally indicate an invalid/badly generated PMML document. Which software did you use to generate your PMML document - must be some non JPMML-family software?

I found the problem which is hard to solve, in generated pmml file there is a bound called internal "", when the input value exceed this range, that error will show up. However, after I remove all internal, the output prediciton seems to differ from original one. How the lightgbm solve this problem, they just go to the nearest bound? like if it met 100 and upper bound is 96, it will just go to 96?

wzxiong commented 5 years ago

expected: 'DOUBLE', got: '4.0'. error: org.jpmml.evaluator.InvalidResultException

@wzxiong The exception type o.j.e.InvalidResultException is related to the data type of the target field (aka label).

Such type exceptions generally indicate an invalid/badly generated PMML document. Which software did you use to generate your PMML document - must be some non JPMML-family software?

internal example \<Interval closure="feature_name" leftMargin="0.0" rightMargin="96.0"/>

vruusmann commented 5 years ago

However, after I remove all internal, the output prediciton seems to differ from original one.

@wzxiong Would you mind opening a new issue with the JPMML-LightGBM project, and providing a fully reproducible example there? Something where LightGBM and LightGBM-converter-to-PMML are giving different predictions?

Our last comments have no relation to the original issue - your MiningModel element is working as expected, the problem is somehow related to input field values (outside of the intended applicability domain of the model).