jpmml-evaluator requires terminal classification TreeModel Nodes to have score attributes even if they have ScoreDistributions

jpmml / jpmml-evaluator

Java Evaluator API for PMML

GNU Affero General Public License v3.0

892 stars 255 forks source link

jpmml-evaluator requires terminal classification TreeModel Nodes to have score attributes even if they have ScoreDistributions #7

Closed rriegs closed 9 years ago

rriegs commented 9 years ago

From the PMML spec (versions 2.0 and up):

When a Node is selected as the final Node and if this Node has no score attribute, then the highest recordCount in the ScoreDistribution determines which value is selected as the predicted class. If a Node contains a sequence of ScoreDistribution elements such that there is more than one entry where _recordCounti is an upper bound, then the first entry is selected.

Note: If a Node has an attribute score then this attribute value overrides the computation of a predicted value from the ScoreDistribution.

The above suggests that it should be OK for a terminal Node in a TreeModel to omit the score attribute so long as it contains at least one ScoreDistribution element and, further, that including a score attribute may in fact weaken the contribution of the ScoreDistributions (though it is of course always possible to add a score attribute that accurately reflects the behavior specified in the above).

Note that, when using multipleModelMethod="average" for a series of TreeModels, jpmml-evaluator (as of 1.1.17) appears to completely ignore the score attributes (i.e. you can set them all to "foo"), instead relying entirely on the ScoreDistributions to make its prediction. It seems odd to be required to provide an attribute that isn't going to be used at all.

vruusmann commented 9 years ago

I just pushed commit bc19ebce634 that affects class TreeModelEvaluator.

The return type of method TreeModelEvaluator#evaluate(ModelEvaluationContext) depends on the function type:

Classification-type models return an instance of NodeClassificationMap. If you analyze the method NodeClassificationMap#getResult(), then it is easy to see that a non-null score attribute takes priority over the value attribute of the highest-probability ScoreDistribution element.
Regression-type models return an instance of NodeScore. The method NodeScore#getResult() returns the value of the score attribute (after it has been converted to double data type and post-processed as specified by the Target element).

The score attribute is required. The PMML specification says the following: "it is not possible that the scoring process ends in a Node which does not have a score attribute".

Regarding the multipleModelMethod="average" issue - are you working with a classification- or regression-type tree ensemble? Is it possible to attach a sample PMML file?

rriegs commented 9 years ago

I've also left a comment over at https://groups.google.com/d/msg/jpmml/Du0QMIYyvko/BAq8n9rBgK4J concerning a separate but related question.

Please see attached model and test file at https://groups.google.com/d/msg/jpmml/Du0QMIYyvko/-bnXhyYblFUJ

I'm working with a classification-type tree ensemble. Regression-type tree ensembles do use score with multipleModelMethod="average" as appropriate.

I see the line you've quoted from the PMML spec and can only conclude that the spec is somewhat internally inconsistent. It does claim that the score attribute is required at final Nodes, but also that ScoreDistribution is used to choose the predictedValue if and only if the score attribute is not provided.

vruusmann commented 9 years ago

Thank you for the extra input.

As you probably noticed, classification-type ensemble models perform aggregation using the org.jpmml.evaluator.HasProbability interface. During aggregation, there is no distinction between "winner" and "loser" class labels, so the score attribute can be safely ignored.

When speaking about classification-type tree models, then it is safe to say that a PMML document is inconsistent if the value of the score attribute does not have a matching ScoreDistribution element. This inconsistency can be discovered using static analysis. In my opinion, it would be too wasteful to perform consistency checks on every NodeClassificationMap instance in runtime.

Static analyzers can be implemented using the Visitor design pattern. Simply create a subclass of org.jpmml.evaluator.visitors.FeatureInspector and apply it to your PMML class model object right after it is unmarshalled from the PMML document.

As for the quality of the PMML specification, then it is good/unambiguous enough 99% of time. The remaining 1% represents various edge- and corner cases that surface only when the spec is implemented in the actual application code. I hope that the JPMML implementation of such cases agrees with proprietary implementations.

rriegs commented 9 years ago

Thank you for your responses, Villu. I am satisfied with this explanation and resolution. I will modify my code to handle the required score attributes.

vruusmann commented 9 years ago

Actually, class NodeClassificationMap needs some modification, because it does not implement the method org.jpmml.evaluator.CategoricalResultFeature#getCategoryValues() correctly when there are no ScoreDistribution elements available.

The correct behaviour would be to return a singleton set that contains the score attribute value.

This fix is in the works. I will push it to the repository later in the evening.

vruusmann commented 9 years ago

Commit d0b1dd8e04 makes sure that class NodeClassificationMap implements the interface HasProbability (and its superinterface CategoricalResultFeature) correctly in a situation where the Node element does not have any ScoreDistribution child elements.