jpmml / r2pmml

R library for converting R models to PMML
GNU Affero General Public License v3.0
73 stars 18 forks source link

ranger converter sets recordCount instead of probablity #59

Closed chmielot closed 4 years ago

chmielot commented 4 years ago

In RangerConverter.java in encodeProbabilityForest a new ScoreDistribution object is created. The probablity is passed as the second parameter to the constructor. However, ScoreDistribution expects the recordCount as the second parameter. This leads to an invalid PMML file, where the probability value is in the recordCount attribute.

  1. Use setProbability on the newly created object to set the probability
  2. Maybe fix the typo scoreDisctibution => scoreDistribution

Thank you!

vruusmann commented 4 years ago

It appears to be the case that probabilistic ranger classifiers report "normalized record counts" rather than "absolute (aka raw) record counts".

A "normalized record count" is functionally equivalent to a probability. Perhaps they should be de-normalized to "absolute record counts" to reduce the confusion, but we need to know the total number of (training-) data records at each leaf node first. I can't recall if this information is available in the ranger object or not. Perhaps you can investigate it further.

However, ScoreDistribution expects the recordCount as the second parameter.

The Java local variable could be renamed from (max)probability to (max)recordCount. However, it holds the correct value, and is passed as a correct argument to the ScoreDistribution constructor.

The underlying JPMML-R library contains two integration tests for probabilistic ranger models:

  1. RangerProbAudit: https://github.com/jpmml/jpmml-r/blob/master/src/test/R/ranger.R#L24-L34
  2. RangerProbIris: https://github.com/jpmml/jpmml-r/blob/master/src/test/R/ranger.R#L75-L85

They both pass cleanly, which is sufficient proof that this are at least functionally correct (ie. (J)PMML is able to reproduce ranger probabilities).

This leads to an invalid PMML file, where the probability value is in the recordCount attribute.

What software do you use for PMML validation?

According to PMML XSD, the ScoreDistribution@recordCount attribute is required, whereas the ScoreDistribution@probability attribute is optional.

If one only has "normalized record counts", then it seems justified to assign them to the former attribute (and leave the latter attribute missing).

chmielot commented 4 years ago

However, it holds the correct value, and is passed as a correct argument to the ScoreDistribution constructor.

I don't agree with you. When you look at the PMML 4.3 specification of the Node element it says:

recordCount: The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. These numbers do not necessarily determine the number of records which have been used to build/train the model. Nevertheless, they allow to determine the relative size of given values in a ScoreDistribution as well as the relative size of a Node when compared to the parent Node.

IMHO, the first sentence alone is reason enough to assume that ScoreDistribution/@recordCount cannot be a "normalized record count".

The explanation for recordCount on ScoreDistribution actually also doesn't leave any space for interpretation if this is absolute or relative:

recordCount: This attribute of ScoreDistribution is the size (in number of records) associated with the value attribute.

It is supposed to be the "number of records", it doesn't mention "relative" anywhere.

What software do you use for PMML validation?

In my case I am writing a general parser for PMML, I have no experience in Ranger or R itself. So I'm looking at it from a pure programming perspective. Putting probabilities (aka "normalized record counts") into the recordCount attribute makes it ambigious and I'm not able to differentiate what kind of values I am dealing with. Normally I'd take the ScoreDistribution/@recordCount and divide it by Node/@recordCount to get the probabilities, if the probability attribute is not set on the ScoreDistribution.

IMHO a solution is to call setProbability with the "normalized record count" (bonus) and add the absolute recordCount with the correct value. As you have pointed out already, recordCount is a mandatory attribute. Ideally also the recordCount for the Node is set for the sake of completeness.

I wish I was able to help to find out if and where the information is available in the ranger object, but I don't even know how and where to start, I'm afraid. I'm not working with these tools myself, just looking at the generated outputs.

vruusmann commented 4 years ago

When you look at the PMML 4.3 specification of the Node element it says: "they allow to determine the relative size of given values in a ScoreDistribution"

Looks like normalized record counts are allowed after all?

It is supposed to be the "number of records", it doesn't mention "relative" anywhere.

Due to the very relaxed nature of PMML, everything that's not explicitly prohibited, is allowed.

"Normalized record counts" is a proper subset of "record counts"; there's nothing in the PMML specification saying that the ScoreDistribution@recordCount must hold an integer value. In fact, its XSD type is NUMBER:

<xs:element name="ScoreDistribution">
  <xs:attribute name="recordCount" type="NUMBER" use="required"/>
  <xs:attribute name="probability" type="PROB-NUMBER"/>
</xs:element>

IMHO a solution is to call setProbability with the "normalized record count" (bonus) and add the absolute recordCount with the correct value.

It's impossible to set the absolute record count, if the size of the training set (as well as the sizes of leaf nodes) is unknown.

I'm not able to differentiate what kind of values I am dealing with

See above comment - you're trying to force an arbitrary constraint yourself (integer record counts).

TLDR: Looks like you should ignore ranger-based PMML documents. They are 100% syntactically/semantically valid PMML, but if your application thinks otherwise, then feel free to disagree.

chmielot commented 4 years ago

Hi Villu,

it's unfortunate that the specification is not specific enough about this topic, so that it allows us to have different interpretations.

My target is to get the probability of a ScoreDistribution in an automated and straight-forward way. Bottom line of all of these discussions is that I could achieve this easily if the probability attribute was set. I wouldn't need to worry about the recordCount anymore. And because the probability is availble as the "normalized record count", could you kindly add the probability attribute to the ScoreDistribution with the same value?

I think that's 100% covered by the spec and from what you have written so far, are you ok with this?

vruusmann commented 4 years ago

My target is to get the probability of a ScoreDistribution in an automated and straight-forward way.

In pseudocode:

if(scoreDistribution has probabilities){
  // Use provided probabilities
} else {
  // Calculate probabilities from record counts
}

could you kindly add the probability attribute to the ScoreDistribution with the same value?

Duplicating the same information makes no sense. Even worse, it adds to memory consumption, which is already rather high for classification-type random forest models.

You can always post-process r2pmml generated PMML documents, and add duplicate attribute values if your application can't figure it out itself on-the-fly.

My point is that the record count is mandatory, whereas probability is optional. As they are identical in the current case, I'm only filling in the required field.

chmielot commented 4 years ago

Hi, that's really unfortunate. May I ask for your advice on how to determine the probabilities then? For all other PMML models I have worked with so far I have an absolute record count in Node and ScoreDistribution and I divide them to get the probability. I am basically following what is written in the spec:

The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements.

As I cannot be sure if the PMML is coming from ranger or a different library, should I rely on the missing recordCount on the Node to take the recordCount of the ScoreDistribution as the probability?

In pseudo-code:

if(scoreDistribution has probabilities){
  probability = ScoreDistribution.probability;
} else if (Node has recordCount) {
  probability = ScoreDistribution.recordCount / Node.recordCount;
} else {
  probability = ScoreDistribution.recordCount;
}
vruusmann commented 4 years ago

I have an absolute record count in Node

That's a wrong assumption? The Node@recordCount attribute value uses the same scale as its child ScoreDistribution@recordCount attribute values.

and ScoreDistribution and I divide them to get the probability.

If the Node@recordCount attribute value is missing, then you can simply calculate it by summing the recordCounts of child ScoreDistribution@recordCount attribute values.

In case of normalized record counts, the ScoreDistribution@recordCount attribute values of a specific node should sum exactly to 1.0.

chmielot commented 4 years ago

I have an absolute record count in Node

That's a wrong assumption? The Node@recordCount attribute value uses the same scale as its child ScoreDistribution@recordCount attribute values.

In fact, my assumption is not important. No matter if it's absolute or relative, if the scale is the same and if the Node@recordCount is set, then ScoreDistribution@recordCount divided by Node@recordCount will always give me the probability I'm looking for.

For what it's worth, I've never seen a PMML in the wild where the record counts were not absolute. The samples in the PMML spec also only show integers. That's where my surprise and confusion about these normalized record counts comes from. From my perspective ranger PMMLs are an exception to the rule.

If the Node@recordCount attribute value is missing, then you can simply calculate it by summing the recordCounts of child ScoreDistribution@recordCount attribute values.

Yes, that's clear.

In case of normalized record counts, the ScoreDistribution@recordCount attribute values of a specific node should sum exactly to 1.0.

In an ideal world yes, in the real world summing up float numbers might give you slightly different results due to rounding (no offense, for me this is just another reason why relative record counts have a negative connotation).

Thank you for participating so comprehensively in this discussion. My conclusion is that there is only one bullet proof way to be find out if I'm dealing with absolute or relative record counts when the Node@recordCount is missing and when I therefore can't calculate the probability ad-hoc:

If the first ScoreDistribution@recordCount I am parsing is <1, I'm seeing relative record counts, I can take this as a probability immediately. If it's >1, I have absolute recordCounts. But if ScoreDistribution@recordCount is exactly 1.0 I have a little bit more parsing effort to first see what the other recordCounts are.

Thanks again!

vruusmann commented 4 years ago

My conclusion is that there is only one bullet proof way to be find out if I'm dealing with absolute or relative record counts when the Node@recordCount is missing.

Your algorithm doesn't seem bullet-proof at all.

Consider this workflow instead:

if(first scoreDistribution has both recordCount and probability){
  // Switch to "probabilistic mode". Ignore record counts, and assume that all ScoreDistribution elements will have the probability attribute set. 
} else if(first scoreDistribution has only recordCount){
  // Switch to "record count mode". Calculate probabilities locally using record counts. If the parent Node element has a recordCount attribute, use that as the denominator. Otherwise, calculate the total record count yourself. 
} else {
  // Raise exception - invalid PMML, because ScoreDistribution@recordCount attribute is missing.
}

in the real world summing up float numbers might give you slightly different results due to rounding

The PMML specification states that probabilities must sum exactly to 1.0 (in the chosen math context - either 32bit or 64bit). If they sum to something else (such as 0.999 or 1.001) then it's an invalid PMML.

chmielot commented 4 years ago

// [...] and assume that all ScoreDistribution elements will have the probability attribute set.

I believe I can safely assume that or otherwise decline as invalid PMML. PMML spec says:

If defined for any class label, it must be defined for all

The PMML specification states that probabilities must sum exactly to 1.0

You'll find this statement only for the probability attribute (which btw is of type PROB-NUMBER and not NUMBER). So I must not assume that they add up to exactly 1.0 for the recordCount and I cannot discard the PMML if they don't. You said it yourself: "everything that's not explicitly prohibited, is allowed". You have to apply this statement here as well, when you're fair. And are you actually validating and guaranteeing that for the ranger data?

Your algorithm doesn't seem bullet-proof at all.

Can you please elaborate on what can go wrong in my algorithm? For the above reason I cannot use yours and I don't see the loophole in mine.

vruusmann commented 4 years ago

Can you please elaborate on what can go wrong in my algorithm?

Your algorithm fails if there's a ScoreDistribution@recordCount="0" (ie. when a leaf node does not contain any samples for a particular class).

vruusmann commented 4 years ago

My algorithm checks for the presence/absence of attribute DECLARATIONS. Your algorithm checks attribute VALUES.

Clearly, my algorithm is more robust than yours.

chmielot commented 4 years ago

Your algorithm fails if there's a ScoreDistribution@recordCount="0" (ie. when a leaf node does not contain any samples for a particular class).

I assumed it's obvious that the probability of this leaf node is 0. Still not seeing the loophole...

My algorithm checks for the presence/absence of attribute DECLARATIONS. Your algorithm checks attribute VALUES.

Yes, because I'm not validating the PMML. The only thing I care about is to get the probabilities in a clear and concise way. In a way that is agnostic to where the PMML comes from, in a way that always works, no matter how people interpret inexactnesses in the PMML spec. I'm being opportunistic, not idealistic.

Clearly, my algorithm is more robust than yours.

Clearly, you refuse to apply your own line of thought to your algorithm. As I pointed out in my earlier post, it is unambiguously based on false assumptions.

vruusmann commented 4 years ago

I assumed it's obvious that the probability of this leaf node is 0

Another assumption you're making.

Yes, because I'm not validating the PMML.

But you were making claims about the validity of r2pmml generated PMML documents.

you refuse to apply your own line of thought to your algorithm.

My algorithm works in practice: https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/tree/TreeModelEvaluator.java#L357-L451

I'd say that it's you who refuses to change incorrect line of thought. Just relax, and assume that the ScoreDistribution@recordCount can hold non-integer values.

For example, the following two declarations are functionally identical (ie. lead to the same probability distribution to be predicted).

First:

<Node>
  <ScoreDistribution value="A" recordCount="2"/>
  <ScoreDistribution value="B" recordCount="5"/>
</Node>

Second:

<Node>
  <ScoreDistribution value="A" recordCount="0.2"/>
  <ScoreDistribution value="B" recordCount="0.5"/>
</Node>

Please explain what makes the first example "more valid" than the second example?

chmielot commented 4 years ago

Despite that your second example doesn't add up to 1.0 (which I assume is unintended, but maybe it's another assumption I'm making :wink:), it is technically perfectly valid. However, you're asking the validity question again, so you didn't get my point.

Let me summarize: I've created this issue stating that the library is creating invalid PMML. It turns out that absolute values for the record counts are not available from ranger and that the recordCount is a mandatory attribute. So you can either decide that ranger models cannot be converted to PMML due to missing mandatory values or you introduce "relative record counts", exploiting the fact that PMML's XSD allows to set floats for the recordCounts. You went (somehow understandably) with the latter, at least it's better than having nothing, right? So thank you for your work! The PMML the library creates is technically valid!

Still, in my opinion "relative record counts" are a far-fetched overinterpretation/exploitation of the XSD, not sustained by the PMML spec. Having them comes with ambiguities and with issues when parsing the PMML, we've discussed them in this GH issue. This becomes very clear when you compare you're algorithm in practice:

https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/tree/TreeModelEvaluator.java#L357-L451

to the workflow you've recommended earlier:

me: in the real world summing up float numbers might give you slightly different results due to rounding

you: The PMML specification states that probabilities must sum exactly to 1.0 (in the chosen math context - either 32bit or 64bit). If they sum to something else (such as 0.999 or 1.001) then it's an invalid PMML.

  1. In your code you are not checking if the values in the probability attribute of ScoreDistribution sum up to exactly 1 (this can be easily fixed)
  2. In case the "relative record counts" in the recordCount attribute don't sum up to exactly 1 (due to rounding errors), you are actually altering the probabilities (by dividing by e.g. 1.00001)

As opposed to what you've recommended to me, you shouldn't and you actually aren't discarding the PMML just because the recordCounts don't sum up to 1. It is nowhere written that they have to. The PMML spec is only mentioning it for the probability attribute and you cannot apply this to recordCounts as well, just because you claim that you've put probablities there. Your code is therefore correct in this regard and the only way to solve the second issue is to tediously apply the algorithm I've described to prevent dividing by a very small deviation of 1.

Just to make my and other people's lifes easier I've kindly asked you to add a probability attribute. You don't want to do this, because you don't want to duplicate values. Another option is to add a recordCount="1" to Node. I expect that you'll refuse this proposition as well, because:

If the Node@recordCount attribute value is missing, then you can simply calculate it by summing the recordCounts of child ScoreDistribution@recordCount attribute values.

Both of these options are not hard to implement and would allow to apply the same parsing (not validation!!) algorithm that works for probably 99% of all other similar PMMLs out there. And that is my point.

There are other options to get rid of these inelegant relative record counts: 1) Ask ranger to provide absolute record counts (I've asked there if this is possible) 2) Ask the PMML group to make either recordCount OR probability mandatory per ScoreDistribution. This would allow to set only the probability attribute.

Unil we have "nicer" and more explicit PMMLs I'm happy we can profit from ranger PMML at all 👍 and I'll work around any issues. Thank you again.

vruusmann commented 4 years ago

Despite that your second example doesn't add up to 1.0

This was very much intended.. Where is it stated in the PMML specification that record counts must sum to 1.0?

Still, in my opinion "relative record counts" are a far-fetched overinterpretation/exploitation of the XSD, not sustained by the PMML spec.

Your opinion is not relevant.

In your code you are not checking if the values in the probability attribute of ScoreDistribution sum up to exactly 1 (this can be easily fixed)

This check existed in the JPMML-Evaluator library, but was removed, because there are PMML producer applications (other than JPMML conversion libraries) that are generating probability distributions that don't sum to 1.0 (they sum to 0.999999998 or smth like that).

Another option is to add a recordCount="1" to Node

That's another duplication of information. You can derived this value independently by summing the record counts of child ScoreDistribution elements.

TLDR: You have no idea how PMML works in practice.

chmielot commented 4 years ago

Don't feed the trolls.