Closed jbaum-cmcrc closed 7 years ago
The code for Regression is intended to match the C++ version at https://www.johndcook.com/blog/running_regression/ and I believe it does so. I have noticed the descrepancies myself and thought it strange but I did it to match Cook's code. He derived the math, not me.
Would you do two things:
The ddof parameter is a recently added feature. I wasn't sure how it would impact the regresssion calculations so it was left out as a parameter.
Ah, the discrepancy is between the Python code and John Cook's C++.
Statistics.variance
takes a ddof
parameter which defaults to 0.0
.RunningStats.Variance
uses hard-coded 1
for ddof
.Yes, when runstats introduced the ddof parameter we chose to default to 0.0 to match numpy: https://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html
Though we introduced the parameter and changed the default, I believe the code for calculating regressions is still the same. Is it not?
I think the difference is:
python-runstats:
def correlation(self):
"""Correlation of values."""
term = self._xstats.stddev() * self._ystats.stddev()
return self._sxy / ((self._count - 1) * term)
In this case, self._xstats.stddev() with ddof defaulting to 0.0 evaluates to: (note: inlining the call to variance())
(self._rho / (self._count - 0.0))**0.5
The c++ version
double RunningRegression::Correlation() const
{
double t = x_stats.StandardDeviation() * y_stats.StandardDeviation();
return S_xy / ( (n-1) * t );
}
In this case, x_stats.StandardDeviation() evaluates to: (again inlining the call to Variance())
sqrt(M2/(n-1.0));
So this looks like a difference between the two.
>>> import runstats
>>> r = runstats.Regression()
>>> r.push(2, 5)
>>> r.push(3, 7.1)
>>> r.push(4, 8.9)
>>> r.push(5, 11.2)
>>> r.push(6, 13.1)
>>> r.slope()
2.03
>>> r.intercept()
0.9400000000000013
>>> r.correlation()
1.2493483465746922
>>>
RunningRegression r;
r.Push(2, 5);
r.Push(3, 7.1);
r.Push(4, 8.9);
r.Push(5, 11.2);
r.Push(6, 13.1);
printf("%f %f %f\n", r.Slope(), r.Intercept(), r.Correlation());
Outputs:
2.030000 0.940000 0.999479
As @jimmo writes, because we changed the ddof
that stdev uses from hard-coded 1.0 to default 0.0, the regression function now effectively calls stdev(..., ddof=0)
rather than stdev(.., ddof=1)
.
This difference is not directly visible in the regression
source, since it's an implied/default parameter, but it affects results, particularly when the number of samples is small.
Probably as an initial part of this ticket we should write a unit test for regression
that passes before the ddof
changes and fails now.
I made ddof=0.0 to match numpy but as I think about it, I don't think it was the right choice. John Cook chose ddof=1.0 which aligns with observing a sample rather than the whole population (which I think is the main scenario).
In 030be257a67ad5329542546f42255f80b1294396 I changed the code to use ddof=1.0 by default. also adopted the changes from the merge request (#20) so that ddof=0.0 is possible and should work with the regression objects.
I also included a copy of John's code in the tests directory and have a little program there for running the code and comparing results. Going forward, I'll make sure the output of the Python code matches the C++ code.
Changes released to PyPI at v1.7.1
Various methods in the Regression class appear to handle
ddof
inflexibly, inconsistently or incorrectly.slope
andintercept
are hard-coded to useddof=1
correlation
uses the defaultddof=0
when it callsstddev
but then usesddof=1
internallyProposed solution:
Regression
methods to use ddof correctly, consistently using a passed parameterddof
.Alternative solution:
Regression
methods to use ddof correctly, consistently using 1.0. However, this would be less flexible than a parameter.