Open dokteurwho opened 7 years ago
cc @TomAugspurger
Hi guys. I published this jupyter notebook in which I suggest improvements to autocorrelation_plot
along the lines of what is suggested in this issue. It would be great to hear your feedback before going through with a PR
From my view the current implementation is 1) Wrong since it doesn't meet the definition of the autocorrelation. 2) It is inconsistent to other implementations
This may also cause a lot of confusion since it is totally unexpected. My plea is to calculate the "real" mean instead of dividing by n. Not sure if a +-1 has to be added.
def r(h):
# return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / n / c0
return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / (n-h) / c0
Maybe it's a bug or maybe this way implemented that way to dampen the noisy tail of the plot.
I was confused by this. ;) Cheers
I too was confused by this. Specifically, why my random walks consistently showed strong negative correlation for about two thirds of the chart (from lag-length about 40% of n to the end).
I made a comparison chart of manually calculating correlation, ran it over 100 random walks of length 1,000:
Where the random walk is just x = rng.normal(size=n).cumsum()
I'm new to all this so I'm not sure if my version is right, but I'm pretty sure either the Pandas version is wrong, of the definition of 'autocorrelation' is a bit weird.
Problem description
I can observe
autocorrelation_plot()
generates plots where value importance proportionally decrease with lag.This can be visually verified in documentation example (https://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-autocorrelation) were auto-correlation decreases linearly with lag.
Looking into
autocorrelation_plot()
implementation the reason isr(h)
closure where(data[h:] - mean)).sum()
will proportionally decrease withh
.I would like to know the reason of this implementation choice that is not equivalent to applying
autocorr()
with different lag values.I suspect the reason behind is to limit the importance of auto-correlation terms calculated with few values.
Thanks for your feedback.