Open hleumas opened 1 year ago
I can confirm this on main.
cc @mroeschke as the expert on the window code.
Just to add more context. I believe that the issue is caused by the way the formula works.
Currently, the result is essentially a weighted average of all datapoints with weights exp2(-t/H).
So, for measurements:
x(t0), x(t1), ..., x(tn)
the resulting average is
1/C [x(t0) exp2(-t0/H) + x(t1) exp2(-t1/H) + ... + x(tn) exp2(-tn/H)]
where C is a normalisation constant.
However, what we should be calculating is the following integral:
1/C * ∫ x(t) exp2(-t/H) dt
What is essentially missing in the summation formula is weighting each of the summands with dt. This is not issue as long as dt is same for each summand, which is why this works well for even time intervals. Unfortunately, this is not the case for uneven time intervals.
Also note, the recursive formula doesn't have this particular issue, so a quick fix would be to allow adjust=False
when times
are provided. Currently pandas refuse to accept adjust=False
whenever times
parameter is present.
Thanks for raising the issue @hleumas , your explanation makes sense to me. It does look like adjust=True
just shouldn't be supported when times
is passed
Supporting adjust=False
shouldn't be too hard here
Should adjust=True
be deprecated or is it salvageable?
I implemented this years ago, and I don't exactly recall how adjust
fits into EWMA with times. cc @DiegoAlbertoTorres if you have any thoughts
Hello @hleumas @MarcoGorelli @mroeschke @DiegoAlbertoTorres - thank you for raising this issue / discussion on EWMA weighted by time.
Spent a little bit of time thinking about this topic of adjust={True, False}
for an EWMA with non uniform times: and was hoping that we could potentially explore a small modification to adjust=True
(for both polars and pandas) -
Should adjust=True be deprecated or is it salvageable?
Agree w/ @hleumas with his statement (which I think entails the solution to make adjust=True
and adjust=False
more consistent):
What is essentially missing in the summation formula is weighting each of the summands with dt
Taking a step back, pandas docs define time weighted exponential moving average (under finite history) as:
This is consistent with the current pandas' adjust=True
behaviour under time weights. If we instead start by a more general continuous-time definition of:
And then modifying to discrete, but not neces. uniformly spaced, observations - we get:
Under uniformly spaced time observations, delta_t = 1, and the two obviously become the same. But in my opinion this is better suited model to describe the case of non-uniformly spaced weighted average as the area for each observation is adjusted by its bar width.
To show how these formulations differ and compare to adjust=False
("infinite history" ewma), some sample code based on @hleumas example:
ema_pd
implements adjust=True
(without delta_t weighting)ema_polars
is used as a benchmark for adjust=False
ema_convolve
corresponds to adjust=True
and is a direct implementation of the weighted summation formula above. This is added simply to confirm that the recursive implementations match exactly the summation formula math above.ema_test
is a proposed implementation which supports adjust=True/False and the proposed "delta_t" correction.import pandas as pd
import polars as pl
import numpy as np
from numba import njit
# Current EWM implementation in pandas
def ema_pd(df, e_time, adjust: bool = True):
return df.ewm(
halflife=pd.to_timedelta(e_time),
times=df.index,
adjust=adjust,
).mean()
def ema_polars(df: pd.Series, half_life: pd.Timedelta | str, by: str):
"""I use this as a sanity check for adjust=False"""
return (
pl.from_pandas(df.to_frame(), include_index=True)
.with_columns(pl.col(df.name).ewm_mean_by(by=by, half_life=half_life))
)
@njit()
def ema_test(array: np.array, deltas: np.array, half_life: float,
adjust: bool = False, weight_by_dt: bool = False):
num = 0.
denom = 0.
result = np.zeros_like(array)
cumulative_decay = np.cumsum(deltas / half_life)
cumulative_decay = cumulative_decay[-1] - cumulative_decay
for i, v in enumerate(array):
if deltas[i] == 0:
if i > 1:
result[i] = result[i-1]
continue
# implements the formula with finite sample adjustment
if adjust:
alpha = 0.5 ** cumulative_decay[i]
if weight_by_dt:
alpha *= deltas[i]
num += alpha * v
denom += alpha
result[i] = num / denom
else:
# implements "infinite history" simple recursions
alpha = 0.5 ** (deltas[i] / half_life)
num = alpha * num + (1. - alpha) * v
result[i] = num
return result
def ema_convolve(array: np.array, deltas: np.array, half_life: float, weight_by_dt: bool = False):
"""Implements the weighted average summation formula directly"""
cumulative_decay = np.cumsum(deltas / half_life)
cumulative_decay = cumulative_decay[-1] - cumulative_decay
kernel = 0.5 ** cumulative_decay
# implements the dt bar width correction such each sample in weighted sum takes into account width of time bar it applies to
if weight_by_dt:
kernel *= deltas
numerator = (array * kernel).sum()
denominator = kernel.sum()
# alternatively can use convolution to get entire path
# kernel = 0.5 ** np.cumsum(deltas / half_life)
# numerator = np.convolve(array, kernel, mode="full")[:len(array)]
# denominator = np.convolve(np.ones_like(array), kernel, mode="full")[:len(array)]
return numerator / denominator
Then let's first confirm that the test implementation matches pandas (and the directly computed weighted average), when we don't do the "dt" correction:
zeroes = 100_000
ms = 1_000_000 # 1ms
# Let's create a list of 100k 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1_000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)
half_life = pd.Timedelta("3s")
vals = np.asarray(df["val"]).astype("float64")
# convert to floats
half_life_seconds = half_life / pd.Timedelta("1s")
dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")
ema_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True)[-1]
ema_adjust_pd = ema_pd(df['val'], half_life, adjust=True).to_numpy()[-1]
ema_weighted_avg = ema_convolve(vals, dt_seconds, half_life=half_life_seconds)
print(ema_adjust)
print(ema_adjust_pd)
print(ema_weighted_avg)
They all agree:
0.0002909852504376329
0.00029098525043919435
0.0002909852504376267
Then let's see how `adjust=False` compares:
ema_no_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=False)[-1]
ema_no_adjust_pl = ema_polars(df["val"].rename_axis(index="ts").rename("val"), half_life,
by="ts").select("val")[-1].item()
print(ema_no_adjust)
print(ema_no_adjust_pl)
These numbers are now a lot larger as more weight is placed on the last non-zero observation:
0.2062994740159002
0.2062994740159002
Finally let's see what turning on the "dt" correction does:
ema_corrected = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True, weight_by_dt=True)[-1]
ema_weighted_avg_corrected = ema_convolve(vals, dt_seconds, half_life=half_life_seconds, weight_by_dt=True)
print(ema_corrected)
print(ema_weighted_avg_corrected)
Notice that the discrepency with the adjust=False result is much smaller.
0.22544862736755866
0.22544862736755866
@hleumas if I take your example numbers verbatim (1s halflife, 1 million samples), I get:
# equivalent to pandas adjust=True
0.0013838961963124187
0.0013838961963452783
0.0013838961963124207
# equivalent to your adjust=False (and polars current default)
0.5
0.5
# equivalent to what a 'corrected' adjust=True would return
0.5808558454220695
0.5808558454220693
Sorry for the long comment - the behaviour of the corrected adjustment, to me, makes more sense and sounds like a relatively easy fix (just multiply the term used to update numerator and denominator by the delta_t(i) and handle the corner cases if it is zero). What are your thoughts? Could we implement this new behaviour of adjust=True for pandas and polars?
Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.
But yeah, other than that this is sensible.
Thanks @azmyrajab for your good explanation. I think it makes sense, I'll experiment with it
Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.
But yeah, other than that this is sensible.
I think the fix might be
if deltas[i] == 0:
if i > 1:
result[i] = result[i-1]
else:
result[i] = values[i]
continue
? As y_0
should equal x_0
(the first observation doesn't have a history to be weighted by)?
I guess I am just confused by deltas
array. Where is it coming from? What is the formula for delta[i]
given times t[i]
?
from here
dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")
Ok, I get it. So, given that we have delta[0] == 0
, we essentially completely ignore the first measured value. Now, I would say that that's slightly strange behaviour, but I guess it doesn't matter as long as we have enough values.
yeah the first result should be kept as-in, that's the modification I was suggesting in https://github.com/pandas-dev/pandas/issues/54328#issuecomment-2090277548
I get that and that is sensible. What I am getting at is the fact that the next results are completely unimpacted by it. E.g. 2nd result is just equaling the 2nd value. All I am saying is that this is strange. But I don't know what would be the ideal solution.
Hi, I opened a PR (linked just above) to allow adjust=False
with times
and to handle irregular-interval deltas. It's scope is narrow and it is still an open question what to do with adjust=True
. But I wanted to keep moving forward with this issue :)
thanks @tserrao !
great that adjust=False
is now supported. adjust=True
still needs adjusting though, so keeping this open until that happens
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Pandas seems to have an issue with timeseries with uneven intervals. Assume following example:
1 million of
0
s spaced 1 millisecond apart followed by a single1
after a 1 second gap:One would naively expect the exponential moving average with a halflife of 1 second to equal to exactly 0.5, assuming equation:
However, due to the way adjustment factor is calculated, this is not true. Unfortunately
adjustment=True
works correctly only for evenly spaced time series and in this situation leads to extremely small result of0.001384
.Also, unfortunately,
adjustment=False
is disabled for calculations wheretimes
argument is set.Expected Behavior
Result that is independent of the sampling frequency irregularities. Thus, increasing sampling frequency in the early times (where 0s are measured) shouldn't lead to increasing their weight in the result.
Result of
0.5
instead of0.001384
.Installed Versions