pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.92k stars 18.03k forks source link

Why does autocorrelation_plot() generate plots where auto-correlation proportionally decrease with lag? #17098

Open dokteurwho opened 7 years ago

dokteurwho commented 7 years ago

Problem description

I can observe autocorrelation_plot() generates plots where value importance proportionally decrease with lag.

This can be visually verified in documentation example (https://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-autocorrelation) were auto-correlation decreases linearly with lag.

Looking into autocorrelation_plot() implementation the reason is r(h) closure where (data[h:] - mean)).sum() will proportionally decrease with h.

I would like to know the reason of this implementation choice that is not equivalent to applying autocorr() with different lag values.

I suspect the reason behind is to limit the importance of auto-correlation terms calculated with few values.

Thanks for your feedback.

gfyoung commented 7 years ago

cc @TomAugspurger

shadiakiki1986 commented 5 years ago

Hi guys. I published this jupyter notebook in which I suggest improvements to autocorrelation_plot along the lines of what is suggested in this issue. It would be great to hear your feedback before going through with a PR

ghost commented 2 years ago

From my view the current implementation is 1) Wrong since it doesn't meet the definition of the autocorrelation. 2) It is inconsistent to other implementations

This may also cause a lot of confusion since it is totally unexpected. My plea is to calculate the "real" mean instead of dividing by n. Not sure if a +-1 has to be added.

def r(h):
        # return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / n / c0
        return ((data[: n - h] - mean) * (data[h:] - mean)).sum() / (n-h) / c0

Maybe it's a bug or maybe this way implemented that way to dampen the noisy tail of the plot.

I was confused by this. ;) Cheers

davidgilbertson commented 2 years ago

I too was confused by this. Specifically, why my random walks consistently showed strong negative correlation for about two thirds of the chart (from lag-length about 40% of n to the end).

I made a comparison chart of manually calculating correlation, ran it over 100 random walks of length 1,000: image

Where the random walk is just x = rng.normal(size=n).cumsum()

I'm new to all this so I'm not sure if my version is right, but I'm pretty sure either the Pandas version is wrong, of the definition of 'autocorrelation' is a bit weird.