Fix ddof handling in Regression

jbaum-cmcrc commented 7 years ago

Various methods in the Regression class appear to handle ddof inflexibly, inconsistently or incorrectly.

slope and intercept are hard-coded to use ddof=1
correlation uses the default ddof=0 when it calls stddev but then uses ddof=1 internally

Proposed solution:

Fix the Regression methods to use ddof correctly, consistently using a passed parameter ddof.

Alternative solution:

Fix the Regression methods to use ddof correctly, consistently using 1.0. However, this would be less flexible than a parameter.

grantjenks commented 7 years ago

The code for Regression is intended to match the C++ version at https://www.johndcook.com/blog/running_regression/ and I believe it does so. I have noticed the descrepancies myself and thought it strange but I did it to match Cook's code. He derived the math, not me.

Would you do two things:

Verify that the Python source matches John Cook's C++ source.
If you suspect that John made a mistake then send him an email and let me know what he says.

The ddof parameter is a recently added feature. I wasn't sure how it would impact the regresssion calculations so it was left out as a parameter.

jbaum-cmcrc commented 7 years ago

Ah, the discrepancy is between the Python code and John Cook's C++.

In the Python, Statistics.variance takes a ddof parameter which defaults to 0.0.
John Cook's C++ RunningStats.Variance uses hard-coded 1 for ddof.

grantjenks commented 7 years ago

Yes, when runstats introduced the ddof parameter we chose to default to 0.0 to match numpy: https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html

Though we introduced the parameter and changed the default, I believe the code for calculating regressions is still the same. Is it not?

jimmo commented 7 years ago

I think the difference is:

python-runstats:

def correlation(self):
        """Correlation of values."""
        term = self._xstats.stddev() * self._ystats.stddev()
        return self._sxy / ((self._count - 1) * term)

In this case, self._xstats.stddev() with ddof defaulting to 0.0 evaluates to: (note: inlining the call to variance())

(self._rho / (self._count - 0.0))**0.5

The c++ version

double RunningRegression::Correlation() const
{
    double t = x_stats.StandardDeviation() * y_stats.StandardDeviation();
    return S_xy / ( (n-1) * t );
}

In this case, x_stats.StandardDeviation() evaluates to: (again inlining the call to Variance())

sqrt(M2/(n-1.0));

So this looks like a difference between the two.

>>> import runstats
>>> r = runstats.Regression()
>>> r.push(2, 5)
>>> r.push(3, 7.1)
>>> r.push(4, 8.9)
>>> r.push(5, 11.2)
>>> r.push(6, 13.1)
>>> r.slope()
2.03
>>> r.intercept()
0.9400000000000013
>>> r.correlation()
1.2493483465746922
>>>

RunningRegression r;
r.Push(2, 5);
r.Push(3, 7.1);
r.Push(4, 8.9);
r.Push(5, 11.2);
r.Push(6, 13.1);
printf("%f %f %f\n", r.Slope(), r.Intercept(), r.Correlation());

Outputs:
2.030000 0.940000 0.999479

jbaum-cmcrc commented 7 years ago

As @jimmo writes, because we changed the ddof that stdev uses from hard-coded 1.0 to default 0.0, the regression function now effectively calls stdev(..., ddof=0) rather than stdev(.., ddof=1).

This difference is not directly visible in the regression source, since it's an implied/default parameter, but it affects results, particularly when the number of samples is small.

Probably as an initial part of this ticket we should write a unit test for regression that passes before the ddof changes and fails now.

grantjenks commented 7 years ago

I made ddof=0.0 to match numpy but as I think about it, I don't think it was the right choice. John Cook chose ddof=1.0 which aligns with observing a sample rather than the whole population (which I think is the main scenario).

In 030be257a67ad5329542546f42255f80b1294396 I changed the code to use ddof=1.0 by default. also adopted the changes from the merge request (#20) so that ddof=0.0 is possible and should work with the regression objects.

I also included a copy of John's code in the tests directory and have a little program there for running the code and comparing results. Going forward, I'll make sure the output of the Python code matches the C++ code.

Changes released to PyPI at v1.7.1

grantjenks / python-runstats

Fix ddof handling in Regression #19