grantjenks / python-runstats

Python module for computing statistics and regression in a single pass.
http://grantjenks.com/docs/runstats/
Other
96 stars 19 forks source link

Fix ddof handling in Regression #19

Closed jbaum-cmcrc closed 7 years ago

jbaum-cmcrc commented 7 years ago

Various methods in the Regression class appear to handle ddof inflexibly, inconsistently or incorrectly.

Proposed solution:

Alternative solution:

grantjenks commented 7 years ago

The code for Regression is intended to match the C++ version at https://www.johndcook.com/blog/running_regression/ and I believe it does so. I have noticed the descrepancies myself and thought it strange but I did it to match Cook's code. He derived the math, not me.

Would you do two things:

  1. Verify that the Python source matches John Cook's C++ source.
  2. If you suspect that John made a mistake then send him an email and let me know what he says.

The ddof parameter is a recently added feature. I wasn't sure how it would impact the regresssion calculations so it was left out as a parameter.

jbaum-cmcrc commented 7 years ago

Ah, the discrepancy is between the Python code and John Cook's C++.

grantjenks commented 7 years ago

Yes, when runstats introduced the ddof parameter we chose to default to 0.0 to match numpy: https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html

Though we introduced the parameter and changed the default, I believe the code for calculating regressions is still the same. Is it not?

jimmo commented 7 years ago

I think the difference is:

python-runstats:

def correlation(self):
        """Correlation of values."""
        term = self._xstats.stddev() * self._ystats.stddev()
        return self._sxy / ((self._count - 1) * term)

In this case, self._xstats.stddev() with ddof defaulting to 0.0 evaluates to: (note: inlining the call to variance())

(self._rho / (self._count - 0.0))**0.5

The c++ version

double RunningRegression::Correlation() const
{
    double t = x_stats.StandardDeviation() * y_stats.StandardDeviation();
    return S_xy / ( (n-1) * t );
}

In this case, x_stats.StandardDeviation() evaluates to: (again inlining the call to Variance())

sqrt(M2/(n-1.0));

So this looks like a difference between the two.

>>> import runstats
>>> r = runstats.Regression()
>>> r.push(2, 5)
>>> r.push(3, 7.1)
>>> r.push(4, 8.9)
>>> r.push(5, 11.2)
>>> r.push(6, 13.1)
>>> r.slope()
2.03
>>> r.intercept()
0.9400000000000013
>>> r.correlation()
1.2493483465746922
>>> 
RunningRegression r;
r.Push(2, 5);
r.Push(3, 7.1);
r.Push(4, 8.9);
r.Push(5, 11.2);
r.Push(6, 13.1);
printf("%f %f %f\n", r.Slope(), r.Intercept(), r.Correlation());

Outputs:
2.030000 0.940000 0.999479
jbaum-cmcrc commented 7 years ago

As @jimmo writes, because we changed the ddof that stdev uses from hard-coded 1.0 to default 0.0, the regression function now effectively calls stdev(..., ddof=0) rather than stdev(.., ddof=1).

This difference is not directly visible in the regression source, since it's an implied/default parameter, but it affects results, particularly when the number of samples is small.

Probably as an initial part of this ticket we should write a unit test for regression that passes before the ddof changes and fails now.

grantjenks commented 7 years ago

I made ddof=0.0 to match numpy but as I think about it, I don't think it was the right choice. John Cook chose ddof=1.0 which aligns with observing a sample rather than the whole population (which I think is the main scenario).

In 030be257a67ad5329542546f42255f80b1294396 I changed the code to use ddof=1.0 by default. also adopted the changes from the merge request (#20) so that ddof=0.0 is possible and should work with the regression objects.

I also included a copy of John's code in the tests directory and have a little program there for running the code and comparing results. Going forward, I'll make sure the output of the Python code matches the C++ code.

Changes released to PyPI at v1.7.1