pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.66k stars 17.59k forks source link

BUG: ewma with time gives strange results with adjust = False vs. True, difference is too large to make sense #40098

Open jasonzhang2s opened 3 years ago

jasonzhang2s commented 3 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example


np.random.seed(0)
idx=pd.date_range('20000101','20201231',periods=50000)
df=pd.DataFrame(data=np.random.normal(0, 1, 50000),index=idx)

# exclude first 1000 to avoid un-primed periods
df.ewm(halflife=pd.Timedelta('10d'),times=df.index,adjust=True).mean().iloc[1000:].plot()
df.ewm(halflife=pd.Timedelta('10d'),times=df.index,adjust=False).mean().iloc[1000:].plot()

# Your code here

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here leaving a blank line after the details tag]
mroeschke commented 3 years ago

@DiegoAlbertoTorres do you happen to know offhand how the ewm with times formula changes with adjust=False? (Formula found https://pandas.pydata.org/docs/user_guide/window.html#exponentially-weighted-window)

DiegoAlbertoTorres commented 3 years ago

I am not sure. The implementation of adjust=False rests on the identity that the denominator of EWMA when adjust=True (1 + (1-a)^1 + (1-a)^2 + ...)) is equivalent to simply a (see image below, substitute a with alpha). This is expressed in the docs here: screenshot

However, when time is provided, the weight looks as below: image

This is not a geometric series, so you cannot assume that it is equivalent to simply alpha. This can be easily shown by assuming a time vector which simply repeats the same timestamp to inifinity, which yields an infinite weight. I am not sure what we should do here.

I initially suspected that the adjustment (for the adjust parameter) we make to the iteration should be the same whether time is set or not. But the fact that the proof breaks down with my counterexample, plus Jason's discovery suggests this might not hold at all. I have not run Jason's example, how big is the difference? I think if we double check the code, and construct large enough counter-examples, we should be able to empirically show that the math behind adjust=False does not hold when the weights do not follow a geometric series.

mroeschke commented 3 years ago

Here's a plot of the diff of Jason's data (adjust=True - adjust=False) for reference

adjust_true_minus_false

I think the safest thing to do would be to raise a NotImplementedError for times and adjust=False for now.

MarcoGorelli commented 4 months ago

Relevant issue: https://github.com/pandas-dev/pandas/issues/54328

I'm no expert here, but I think the solution might be to do the opposite of https://github.com/pandas-dev/pandas/pull/40314, i.e. bring back adjust=False and raise on adjust=True?