Generalized Regression Model: Output not a valid pmf (probability mass function)

sonix022 commented 9 years ago

Hello,

My issue pertains to GeneralRegressionModelEvaluator.java and specifically to Generalized Linear Model.

If we take into consideration a two-class (here, +1 and -1) classification problem, then the generalized linear model would estimate the Pr(class = class1) and Pr(class = class2) for any data point given it's feature vector. This is done by modeling the distributions with a logit function. Since we're estimating a pmf, we will have Pr(class = class1) + Pr(class = class2) = 1.

If we look at the loop starting at line 337, it is basically supposed to do the same thing -- iterate our the different classes/categories and compute its probability. Everything goes well for class1, but when the code does the computation for class2 (which is the last category), it assigns value = 0 in line 417 and passes that through the logit function. This will always give the probability of last category to be 0.5, no matter how many categories are there.

For a two-category problem, say the probability we compute in the first iteration of the for loop starting at line 337 for category 1 is value1, then the probability of the other class should be simply (1 - value1). This is not achieved by the code. In fact it would always assign the probability for the last category to be equal to 0.5.

If I'm right, a quick fix could be that for the last category, the probability should be just 1 - sum(all the rest probabilities).

Thanks Akshay

vruusmann commented 9 years ago

The for loop (lines 337 to 455) is followed by a switch statement, which invokes ProbabilityDistribution#normalizeValues() (line 460).

In your two-category problem, the probability of the second category is not 0.5, but depends on the probability of the first category. Eventually, the sum of all probabilities equals 1.0.

sonix022 commented 9 years ago

Yes, that is true. In fact, that is where I was confused the most. We can take any two-dimensional positive vector and normalize it with the sum of its entries to get a valid pdf. But that is not what we want, I think!

I checked the pmf estimates using R (glm using binomial for 2 classes), and they are different from what I get from this code. Everything matches up to the point where we estimate the probability of last category. After that, we estimate 0.5 for last category, and normalize it by the sum of all other probabilities to make it sum to 1.

Thanks

ankitcha commented 9 years ago

+1 to this issue.

@vruusmann It looks like normalize would NOT solve this issue. For instance, lets say probability of class_1 = 0.11, and because of this issue probability of class_2 = 0.5. After normalization, these values become 0.18 & 0.82. You are right that the sum of these normalized probabilities = 1 but the probabilities themselves are different since the actual probabilities should be 0.11 & 0.89.

In this particular case, the variance is small but there are cases where these variances could be substantial.

I ran this evaluator & R for 75000 random samples, and compared the result from these two. The predicted class is same for all those values but the probabilities differ everytime.

vruusmann commented 9 years ago

@sonix022 and/or @ankitcha - Can you provide me with a test case (a PMML file together with input and expected output CSV files) that I could investigate?

My R/Rattle testing module pmml-rattle/src/test includes test cases for Iris ("is versicolor" vs. "is not versicolor") and Audit datasets. Please see the file R/ClassificationTest.R for the definitions of functions generateGeneralRegressionIris() and generateGeneralRegressionAudit(). The JPMML-Evaluator reproduces probability values that are first calculated in those tests. It is possible that my R code has issues - can you spot them?

sonix022 commented 9 years ago

Thanks, @vruusmann .

I took a look at the R code but could not immediately figure out if there is some mistake. Can you please elaborate on the use of following function

categoricalLogitProbabilities = function(probabilities){
    return (probabilities / ((1.0 / (1.0 + exp(0))) + probabilities))
}

Specifically, I'm not able to understand the use of exp(0) in the above function.

If I'm not able to decode what's happening here, then I can send you a test case as suggested by you.

Thanks

vruusmann commented 9 years ago

The expression (1.0 / (1.0 + exp(0))) corresponds to the probability of the last category, when the link function is logit (see line 580 in GeneralRegressionModelEvaluator.java).

Please disregard this test case and work out your own from scratch.

If I recall correctly, then I did not invent this last 0-term on my own - it comes from the PMML specification.

ankitcha commented 9 years ago

@vruusmann as far as I can understand - this expression (1.0 / (1.0 + exp(0))) would always be 0.5, so in a way its just a hard-coded constant. If this is the probability of last category, that means no matter how many categories or the values of features, probability of last category would always be 0.5.

Am I missing something?

Also, the pmml & dataset that I am working is pretty customized to our needs, but I will work on getting a sample dataset & R code (& pmml) tonight, so I can post it here.

Also, could you please provide me the reference to the PMML specification that you are referring, I tried but could not find anything related to it.

Thanks!

sonix022 commented 9 years ago

@vruusmann I ran a very simple 2-class classification example in R using iris dataset. The code, train/test data, and the pmml can be found at https://gist.github.com/sonix022/1376a4a4ecf0c2b999fd

Below I have provided two data points from the test dataset along with the values that we get from R's predict function and from the evaluator. It can be seen that there is discrepancy in the results. Specifically for the first case, R predicts the prob(class = 1) = 0.252927 while the evaluator predicts it to be 0.335924827247574.

This is happening specifically for the reason I mentioned in my first post. Evaluator makes a vector like [0.252927, 0.5] and then normalizes it by dividing each value in this vector with 0.252927 + 0.5 = 0.752927 which gives a valid pmf vector [0.335924..., 0.6640750033], but that is not what is desired. In fact, the output pmf should be [0.252927, 0.747073].

Here is what is happening:

Evaluator computes the value for class 1 correctly as 0.252927
It then takes the value for class 2 to be equal to 0.5
Then it normalizes to get [0.335924..., 0.6640750033]
The correct result should be [0.252927, 0.747073]

6.1,2.6,5.6,1.4,"virginica"
    expected from R - 0.252927 (rounded to 6 decimal places)
    evaluator - 0.335924827247574

6,3,4.8,1.8,"virginica"
    expected from R - 0.330669 (rounded to 6 decimal places)
    evaluator - 0.3980755541466462

It is to be noted that if the expected values are very close to 0 or 1 then normalization would not have much impact and the expected values will be very close to what we get from the evaluator. For that reason, I explicitly removed a feature (Septal.Length) and reduced the training data, so that we see some intermediate values.

Thanks

sonix022 commented 9 years ago

Thanks @vruusmann . This would really help me.

vruusmann commented 9 years ago

This issue was "caused" by the following section (see the description of GeneralRegression models): Now for each response category (value of the target variable) j, let βj be the vector of Parameter estimates for that response category. (If k is the last response category, remember that by convention β k= 0.) Set r j= <x,βj > and s j= exp ri. The probability that our case falls into category j is then p j= sj/ (s1 + ... + s k).

This section applies to the multinomialLogistic model type. Since the evaluation algorithms for multinomialLogistic and generalizedLinear model types are otherwise rather similar, then it was assumed that this approach is applicable for the calculation of latter probabilities as well. However, as you have just demonstrated, this was a bad assumption.

Thanks again. The fixed JPMML-Evaluator version 1.2.1 will be released early next week. Until then you need to work with the GitHub trunk.

ankitcha commented 9 years ago

Thanks @vruusmann

Also, I commented about a minor fix on the commit. Please take a look. Thanks!

vruusmann commented 9 years ago

@ankitcha I think there's a specialization for model types: generalizedLinear for binary classification problems and multinomialLogistic for multi-class classification problems.

Can you generate a PMML file that employs generalizedLinear model type with a multi-classification problems? AFAIK, R/Rattle cannot do this, but maybe some commercial software (SAS, SPSS?) can.

Depending on your success with this, I will either add a sanity check (i.e. the generalizedLinear will assert that the number of target categories is exactly two) or implement the fix as suggested by you.

ankitcha commented 9 years ago

Ok....if thats the case, then I think a sanity check should be good enough.

Thanks!

jpmml / jpmml-evaluator

Generalized Regression Model: Output not a valid pmf (probability mass function) #10