gnu-octave / statistics

The Statistics package for GNU Octave
GNU General Public License v3.0
24 stars 22 forks source link

logistic_regression issues #44

Closed NRJank closed 1 year ago

NRJank commented 1 year ago

as orignially discussed in bug #60348, and mainly fixed in https://github.com/gnu-octave/statistics/pull/43,

a few continued concerns with logistic_regression:

  1. the help text indicates that Intercept = logistic_regression(Y) should be a valid input, but it currently fails on an empty x variable at line 117.

    117    missing = (isnan (y) | any (isnan (x), 2));

    suggest either adding some varargin input logic or put something like the

    if (nargin < 2)
    x = zeros (my, 0);
    endif;

    back in ahead of line 117. I think that will still work for everything else.

  2. Unless I've missed it being implemented, missing is the name of a special matlab missing-data type that is on the pending addition list for octave. it might be best to chose a different variable name for on line 117. Maybe xymissing or similar.

  3. what I was hung up on when looking at this function over at savannah: the intercept and slope values appear to still be negative of the expected values according to the definition in the help. The help still suggests that logit (P_i (X)) = X * SLOPE + INTERCEPT_i. if that's the case, then the slope and intercept should produce a positively increasing logistic regression curve when used as the argument of a logistic function g(x) = exp(h(x))/(1+exp(h(x))). see image below of a standard logistic function with that linear function superimposed, pulled from a python example of computing logistic regression. For most logistic regressions with x axis going from 0->n, you expect a negative intercept and positive slope, like what's produced in the second column of the output p. our coefficients are still opposite that. example from realpython.com

using the data on that same page, and running it in octave:

`[intercept, slope] = logistic_regression ([0 0 0 0 1 1 1 1 1 1]', [0:9]') intercept = 196.19 slope = -56.047

whereas they compute the intercept and slope as [-1.04608067] and [0.51491375] respectively.

looking at another example that uses the positive form for logit(P_i(x)) = x *slope + intercept_i, the first example on wikipedia:

Hours (x_k) | 0.50 | 0.75 | 1.00 | 1.25 | 1.50 | 1.75 | 1.75 | 2.00 | 2.25 | 2.50 | 2.75 | 3.00 | 3.25 | 3.50 | 4.00 | 4.25 | 4.50 | 4.75 | 5.00 | 5.50 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- Pass (y_k)|0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1

according to that example the logistic fit with a definition matching ours should be intercept ~ -4.1, slope ~ 1.5. running it in the current code gives:

pass = [0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1]';
hours = [0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50, 2.75, 3.00, 3.25, 3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50]';
[intercept, slope]=logistic_regression(pass,hours)
intercept = 4.0777
slope = -1.5046

so, we're currently getting slope and intercept data that is negative of our definition, which was one of the original bug triggers.

originally, only one was inverted. there was an arbitrary x = -x in the Initial Calculations section that never made sense. I see you removed that, so as I had also seen, this resulted in both terms being negated. The wikipedia example had me tempted to try just flipping the sign on the example, but the realpython.com example makes me think it's setting some incorrect target for the Newton's method, and I never quite teased that out. all of the definitions within the main function and logistic_regression_likelihood appear to use the correct positive definition for gamma = exp(b1*x + b0) / (1+ exp(b1*x+b0)), so unless I'm missing a flipped definition in the derivative function that send it the wrong direction, not sure where cause is.

pr0m1th3as commented 1 year ago

fixed 1 and 2, but I can't dig deep enough into 3. Perhaps, @acp29 can see it through.

acp29 commented 1 year ago

It is strange @NRJank. That behaviour of logistic_regression that you raise in point 3 is the same as the behaviour of mnrfit in MATLAB, although glmfit in MATLAB does give the same as the answer on the wikipedia page.

pass = [0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1]'; hours = [0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50, 2.75, 3.00, 3.25, 3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50]';

b = mnrfit(hours,pass+1,'model','ordinal')

b =

      4.07771343108764
     -1.50464542837334

b = glmfit(hours,pass,'binomial')

b =

     -4.07771343108763
      1.50464542837333
NRJank commented 1 year ago

that is interesting. either could be correct depending on how the fit equations are defined. what are mnrfit and glm fit putting those constants into?

and what do they give for the python example?

acp29 commented 1 year ago

@NRJank, RE the sign flipping of the regression coefficients, you can achieve the glmfit ('correct'?) result with the logistic_regression function simply by y = -y, instead of x = -x

[intercept, slope]=logistic_regression(pass,hours) intercept = 4.077713366075039 slope = -1.50464540136909

[intercept, slope]=logistic_regression(-pass,hours) intercept = -4.077713366075039 slope = 1.50464540136909

This is a binary outcome example, I'm not sure if the result with that fix from the carpig test case actually looks right, I ws expecting the signs of the coefficients to just flip...

[INTERCEPT, SLOPE, DEV, DL, D2L, P] = logistic_regression (miles, X, true);

Logistic Regression Results (same as for mnrfit, see BIST for logistic_regression):

Number of Iterations: 7 Deviance: 433.197174 Parameter Estimates: Intercept S.E. -16.6895 1.9411 -11.7208 1.7564 -8.0606 1.7137 Slope S.E. 0.1048 0.0808 0.0103 0.0050 0.0645 0.0148 0.0017 0.0007

[INTERCEPT, SLOPE, DEV, DL, D2L, P] = logistic_regression (-miles, X, true);

Logistic Regression Results:

Number of Iterations: 7 Deviance: 433.197174 Parameter Estimates: Intercept S.E. 8.0606 1.7137 11.7208 1.7564 16.6895 1.9411 Slope S.E. -0.1048 0.0808 -0.0103 0.0050 -0.0645 0.0148 -0.0017 0.0007

acp29 commented 1 year ago

As for the python example in matlab, very funky...

b = mnrfit ([0:9]', [0 0 0 0 1 1 1 1 1 1]'+1,'model','ordinal') b =

      133.708041489432
     -38.1857497061417

b = glmfit ([0:9]', [0 0 0 0 1 1 1 1 1 1]','binomial')

Warning: Iteration limit reached.

In glmfit (line 338) Warning: The estimated coefficients perfectly separate failures from successes. This means the theoretical best estimates are not finite. For the fitted linear combination XB of the predictors, the sample proportions P of Y=N in the data satisfy: XB<-4.18928: P=0 XB>-4.18928: P=1 In glmfit>diagnoseSeparation (line 558) In glmfit (line 344)

b =

     -299.880172649815
      84.4831111128196
NRJank commented 1 year ago

As for the python example in matlab, very funky...

to be fair, it is a fairly lousy test model

pr0m1th3as commented 1 year ago

It's been a while since this topic had any updates. The way I understand it, the discrepancy between logistic_regression and mnrfit is a matter of how the equations are defined withing, hence the sign flip.

It will take some time before mnrfit and mnrval become available in the statistics package. And even then, we could keep the logistic_regression as a separate function or perhaps deprecate it, but that's a decision for the future to make.

Should we consider this bug fixed for the time being and close this issue?

acp29 commented 1 year ago

My understanding is and as far as I'm concerned the logistic_regression function gives results like Matlab's mnrfit so the bug is fixed, and yes it is just a sign flipping - for binomial problems mnrfit and glmfit give results with opposite signs, but glmfit can only solve binomial problems whereas mnrfit works for a broader range of multinomial problems (someone can correct me if I'm wrong!). I'm concerned that trying to switch the sign that logistic_regression gives will make it give results different from mnrfit and may not be a sufficient solution for more complicated multinomial problems (this would need testing).

I think the funky result in the python example is because it is an ill-conditioned problem. neither mnrfit nor glmfit work with it but atleast glmfit gives a warning. Ideally logistic_regression should give a warning too

NRJank commented 1 year ago

i hadn't had time to revisit in a while. my main concern was, back to I think the original bug report on savannah, the behavior doesn't match the help text. if we're going to preserve that behavior, I think the only change required for consistency would be changing:

logit (P_i (@var{x})) = @var{x} * @var{slope} + @var{intercept}_i,   i = 1 @dots{} k-1

-->

logit (P_i (@var{x})) = -(@var{x} * @var{slope} + @var{intercept}_i),   i = 1 @dots{} k-1

in that case, it would fit with the generally expected definition of g(x) = exp(h(x))/(1+exp(h(x)))