pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.58k forks source link

EWMA weighted by time with `adjust=True` is flawed, and `adjust=False` is not supported #54328

Open hleumas opened 11 months ago

hleumas commented 11 months ago

Pandas version checks

Reproducible Example

import math
import pandas as pd

# Current EWM implementation in pandas
def ema(df, e_time):
    return df.ewm(
        halflife=pd.to_timedelta(e_time),
        times=df.index,
    ).mean()

# Correct EWM calculation for reference
def ema_manual(df, e_time):
    it = df.items()
    _list = []

    lt, s = next(it)
    _list.append(s)

    for t, v in it:
        q = math.exp2((lt - t) / e_time)
        s = s * q + (1 - q) * v
        lt = t
        _list.append(s)

    return pd.DataFrame(_list, index=df.index)

zeroes = 1_000_000
ms = 1_000_000 # 1ms

# Let's create a list of 500,000 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)

# We would expect the EWM of these series with halflife=1s to be exactly 0.5,
# which is confirmed by the manual calculation
print(ema_manual(df['val'], '1s'))

# Whereas pandas EWM returns number close to 0
print(ema(df['val'], '1s'))

Issue Description

Pandas seems to have an issue with timeseries with uneven intervals. Assume following example:

1 million of 0s spaced 1 millisecond apart followed by a single 1 after a 1 second gap:

1970-01-01 00:00:00.000    0
1970-01-01 00:00:00.001    0
1970-01-01 00:00:00.002    0
1970-01-01 00:00:00.003    0
1970-01-01 00:00:00.004    0
...  999,991 more items  ...
1970-01-01 00:16:39.996    0
1970-01-01 00:16:39.997    0
1970-01-01 00:16:39.998    0
1970-01-01 00:16:39.999    0
-----   1 second gap   -----
1970-01-01 00:16:40.999    1

[1000001 rows x 1 columns]

One would naively expect the exponential moving average with a halflife of 1 second to equal to exactly 0.5, assuming equation:

w    = 0.5 ** (dt/halflife) = 0.5 **(1s/1s)
y(t) = y(t - 1) * w + x * (1 - w)

However, due to the way adjustment factor is calculated, this is not true. Unfortunately adjustment=True works correctly only for evenly spaced time series and in this situation leads to extremely small result of 0.001384.

Also, unfortunately, adjustment=False is disabled for calculations where times argument is set.

Expected Behavior

Result that is independent of the sampling frequency irregularities. Thus, increasing sampling frequency in the early times (where 0s are measured) shouldn't lead to increasing their weight in the result.

Result of 0.5 instead of 0.001384.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0f437949513225922d851e9581723d82120684a6 python : 3.11.4.final.0 python-bits : 64 OS : Darwin OS-release : 22.5.0 Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.0.3 numpy : 1.25.1 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.8.0 pip : 23.1.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
lithomas1 commented 11 months ago

I can confirm this on main.

cc @mroeschke as the expert on the window code.

hleumas commented 11 months ago

Just to add more context. I believe that the issue is caused by the way the formula works.

Currently, the result is essentially a weighted average of all datapoints with weights exp2(-t/H).

So, for measurements:

x(t0), x(t1), ..., x(tn)

the resulting average is

1/C [x(t0) exp2(-t0/H) + x(t1) exp2(-t1/H) + ... + x(tn) exp2(-tn/H)]

where C is a normalisation constant.

However, what we should be calculating is the following integral:

1/C * ∫ x(t) exp2(-t/H) dt

What is essentially missing in the summation formula is weighting each of the summands with dt. This is not issue as long as dt is same for each summand, which is why this works well for even time intervals. Unfortunately, this is not the case for uneven time intervals.

hleumas commented 11 months ago

Also note, the recursive formula doesn't have this particular issue, so a quick fix would be to allow adjust=False when times are provided. Currently pandas refuse to accept adjust=False whenever times parameter is present.

MarcoGorelli commented 4 months ago

Thanks for raising the issue @hleumas , your explanation makes sense to me. It does look like adjust=True just shouldn't be supported when times is passed

Supporting adjust=False shouldn't be too hard here

Should adjust=True be deprecated or is it salvageable?

mroeschke commented 3 months ago

I implemented this years ago, and I don't exactly recall how adjust fits into EWMA with times. cc @DiegoAlbertoTorres if you have any thoughts

azmyrajab commented 2 months ago

Hello @hleumas @MarcoGorelli @mroeschke @DiegoAlbertoTorres - thank you for raising this issue / discussion on EWMA weighted by time.

Spent a little bit of time thinking about this topic of adjust={True, False} for an EWMA with non uniform times: and was hoping that we could potentially explore a small modification to adjust=True (for both polars and pandas) -

Should adjust=True be deprecated or is it salvageable?

Agree w/ @hleumas with his statement (which I think entails the solution to make adjust=True and adjust=False more consistent):

What is essentially missing in the summation formula is weighting each of the summands with dt


Taking a step back, pandas docs define time weighted exponential moving average (under finite history) as:

image

This is consistent with the current pandas' adjust=True behaviour under time weights. If we instead start by a more general continuous-time definition of:

image

And then modifying to discrete, but not neces. uniformly spaced, observations - we get:

image

Under uniformly spaced time observations, delta_t = 1, and the two obviously become the same. But in my opinion this is better suited model to describe the case of non-uniformly spaced weighted average as the area for each observation is adjusted by its bar width.


To show how these formulations differ and compare to adjust=False ("infinite history" ewma), some sample code based on @hleumas example:

  1. ema_pd implements adjust=True (without delta_t weighting)
  2. ema_polars is used as a benchmark for adjust=False
  3. ema_convolve corresponds to adjust=True and is a direct implementation of the weighted summation formula above. This is added simply to confirm that the recursive implementations match exactly the summation formula math above.
  4. ema_test is a proposed implementation which supports adjust=True/False and the proposed "delta_t" correction.
import pandas as pd
import polars as pl
import numpy as np
from numba import njit

# Current EWM implementation in pandas
def ema_pd(df, e_time, adjust: bool = True):
    return df.ewm(
        halflife=pd.to_timedelta(e_time),
        times=df.index,
        adjust=adjust,
    ).mean()

def ema_polars(df: pd.Series, half_life: pd.Timedelta | str, by: str):
    """I use this as a sanity check for adjust=False"""
    return (
        pl.from_pandas(df.to_frame(), include_index=True)
        .with_columns(pl.col(df.name).ewm_mean_by(by=by, half_life=half_life))
    )

@njit()
def ema_test(array: np.array, deltas: np.array, half_life: float,
             adjust: bool = False, weight_by_dt: bool = False):
    num = 0.
    denom = 0.
    result = np.zeros_like(array)
    cumulative_decay = np.cumsum(deltas / half_life)
    cumulative_decay = cumulative_decay[-1] - cumulative_decay

    for i, v in enumerate(array):
        if deltas[i] == 0:
            if i > 1:
                result[i] = result[i-1]
            continue
        # implements the formula with finite sample adjustment 
        if adjust:
            alpha = 0.5 ** cumulative_decay[i]
            if weight_by_dt:
                alpha *= deltas[i]
            num += alpha * v
            denom += alpha
            result[i] = num / denom
        else:
            # implements "infinite history" simple recursions
            alpha = 0.5 ** (deltas[i] / half_life)
            num = alpha * num + (1. - alpha) * v
            result[i] = num
    return result

def ema_convolve(array: np.array, deltas: np.array, half_life: float, weight_by_dt: bool = False):
    """Implements the weighted average summation formula directly"""
    cumulative_decay = np.cumsum(deltas / half_life)
    cumulative_decay = cumulative_decay[-1] - cumulative_decay
    kernel = 0.5 ** cumulative_decay
    # implements the dt bar width correction such each sample in weighted sum takes into account width of time bar it applies to
    if weight_by_dt:
        kernel *= deltas
    numerator = (array * kernel).sum()
    denominator = kernel.sum()
    # alternatively can use convolution to get entire path
    # kernel = 0.5 ** np.cumsum(deltas / half_life)
    # numerator = np.convolve(array, kernel, mode="full")[:len(array)]
    # denominator = np.convolve(np.ones_like(array), kernel, mode="full")[:len(array)]
    return numerator / denominator

Then let's first confirm that the test implementation matches pandas (and the directly computed weighted average), when we don't do the "dt" correction:

zeroes = 100_000
ms = 1_000_000  # 1ms

# Let's create a list of 100k 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1_000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)

half_life = pd.Timedelta("3s")
vals = np.asarray(df["val"]).astype("float64")

# convert to floats
half_life_seconds = half_life / pd.Timedelta("1s")
dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")

ema_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True)[-1]
ema_adjust_pd = ema_pd(df['val'], half_life, adjust=True).to_numpy()[-1]
ema_weighted_avg = ema_convolve(vals, dt_seconds, half_life=half_life_seconds)

print(ema_adjust)
print(ema_adjust_pd)
print(ema_weighted_avg)

They all agree:

0.0002909852504376329
0.00029098525043919435
0.0002909852504376267
Then let's see how `adjust=False` compares:
ema_no_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=False)[-1]
ema_no_adjust_pl = ema_polars(df["val"].rename_axis(index="ts").rename("val"), half_life,
                              by="ts").select("val")[-1].item()
print(ema_no_adjust)
print(ema_no_adjust_pl)

These numbers are now a lot larger as more weight is placed on the last non-zero observation:

0.2062994740159002
0.2062994740159002

Finally let's see what turning on the "dt" correction does:

ema_corrected = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True, weight_by_dt=True)[-1]
ema_weighted_avg_corrected = ema_convolve(vals, dt_seconds, half_life=half_life_seconds, weight_by_dt=True)

print(ema_corrected)
print(ema_weighted_avg_corrected)

Notice that the discrepency with the adjust=False result is much smaller.

0.22544862736755866
0.22544862736755866

@hleumas if I take your example numbers verbatim (1s halflife, 1 million samples), I get:

# equivalent to pandas adjust=True
0.0013838961963124187
0.0013838961963452783
0.0013838961963124207
# equivalent to your adjust=False (and polars current default)
0.5
0.5
# equivalent to what a 'corrected' adjust=True would return 
0.5808558454220695
0.5808558454220693

Sorry for the long comment - the behaviour of the corrected adjustment, to me, makes more sense and sounds like a relatively easy fix (just multiply the term used to update numerator and denominator by the delta_t(i) and handle the corner cases if it is zero). What are your thoughts? Could we implement this new behaviour of adjust=True for pandas and polars?

hleumas commented 2 months ago

Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.

But yeah, other than that this is sensible.

MarcoGorelli commented 2 months ago

Thanks @azmyrajab for your good explanation. I think it makes sense, I'll experiment with it

Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.

But yeah, other than that this is sensible.

I think the fix might be

        if deltas[i] == 0:
            if i > 1:
                result[i] = result[i-1]
            else:
                result[i] = values[i]
            continue

? As y_0 should equal x_0 (the first observation doesn't have a history to be weighted by)?

hleumas commented 1 month ago

I guess I am just confused by deltas array. Where is it coming from? What is the formula for delta[i] given times t[i]?

MarcoGorelli commented 1 month ago

from here

dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")
hleumas commented 1 month ago

Ok, I get it. So, given that we have delta[0] == 0, we essentially completely ignore the first measured value. Now, I would say that that's slightly strange behaviour, but I guess it doesn't matter as long as we have enough values.

MarcoGorelli commented 1 month ago

yeah the first result should be kept as-in, that's the modification I was suggesting in https://github.com/pandas-dev/pandas/issues/54328#issuecomment-2090277548

hleumas commented 1 month ago

I get that and that is sensible. What I am getting at is the fact that the next results are completely unimpacted by it. E.g. 2nd result is just equaling the 2nd value. All I am saying is that this is strange. But I don't know what would be the ideal solution.

tserrao commented 1 week ago

Hi, I opened a PR (linked just above) to allow adjust=False with times and to handle irregular-interval deltas. It's scope is narrow and it is still an open question what to do with adjust=True. But I wanted to keep moving forward with this issue :)