Closed rriegs closed 9 years ago
I just pushed commit bc19ebce634 that affects class TreeModelEvaluator
.
The return type of method TreeModelEvaluator#evaluate(ModelEvaluationContext)
depends on the function type:
NodeClassificationMap
. If you analyze the method NodeClassificationMap#getResult()
, then it is easy to see that a non-null score
attribute takes priority over the value
attribute of the highest-probability ScoreDistribution
element.NodeScore
. The method NodeScore#getResult()
returns the value of the score
attribute (after it has been converted to double
data type and post-processed as specified by the Target
element).The score
attribute is required. The PMML specification says the following: "it is not possible that the scoring process ends in a Node which does not have a score attribute".
Regarding the multipleModelMethod="average"
issue - are you working with a classification- or regression-type tree ensemble? Is it possible to attach a sample PMML file?
I've also left a comment over at https://groups.google.com/d/msg/jpmml/Du0QMIYyvko/BAq8n9rBgK4J concerning a separate but related question.
Please see attached model and test file at https://groups.google.com/d/msg/jpmml/Du0QMIYyvko/-bnXhyYblFUJ
I'm working with a classification-type tree ensemble. Regression-type tree ensembles do use score
with multipleModelMethod="average"
as appropriate.
I see the line you've quoted from the PMML spec and can only conclude that the spec is somewhat internally inconsistent. It does claim that the score
attribute is required at final Nodes, but also that ScoreDistribution is used to choose the predictedValue
if and only if the score
attribute is not provided.
Thank you for the extra input.
As you probably noticed, classification-type ensemble models perform aggregation using the org.jpmml.evaluator.HasProbability
interface. During aggregation, there is no distinction between "winner" and "loser" class labels, so the score
attribute can be safely ignored.
When speaking about classification-type tree models, then it is safe to say that a PMML document is inconsistent if the value of the score
attribute does not have a matching ScoreDistribution element. This inconsistency can be discovered using static analysis. In my opinion, it would be too wasteful to perform consistency checks on every NodeClassificationMap
instance in runtime.
Static analyzers can be implemented using the Visitor design pattern. Simply create a subclass of org.jpmml.evaluator.visitors.FeatureInspector
and apply it to your PMML class model object right after it is unmarshalled from the PMML document.
As for the quality of the PMML specification, then it is good/unambiguous enough 99% of time. The remaining 1% represents various edge- and corner cases that surface only when the spec is implemented in the actual application code. I hope that the JPMML implementation of such cases agrees with proprietary implementations.
Thank you for your responses, Villu. I am satisfied with this explanation and resolution. I will modify my code to handle the required score
attributes.
Actually, class NodeClassificationMap
needs some modification, because it does not implement the method org.jpmml.evaluator.CategoricalResultFeature#getCategoryValues()
correctly when there are no ScoreDistribution elements available.
The correct behaviour would be to return a singleton set that contains the score
attribute value.
This fix is in the works. I will push it to the repository later in the evening.
Commit d0b1dd8e04 makes sure that class NodeClassificationMap
implements the interface HasProbability
(and its superinterface CategoricalResultFeature
) correctly in a situation where the Node element does not have any ScoreDistribution child elements.
From the PMML spec (versions 2.0 and up):
The above suggests that it should be OK for a terminal Node in a TreeModel to omit the
score
attribute so long as it contains at least one ScoreDistribution element and, further, that including ascore
attribute may in fact weaken the contribution of the ScoreDistributions (though it is of course always possible to add ascore
attribute that accurately reflects the behavior specified in the above).Note that, when using
multipleModelMethod="average"
for a series of TreeModels, jpmml-evaluator (as of 1.1.17) appears to completely ignore thescore
attributes (i.e. you can set them all to"foo"
), instead relying entirely on the ScoreDistributions to make its prediction. It seems odd to be required to provide an attribute that isn't going to be used at all.