pola-rs / polars-xdt

Polars plugin offering eXtra stuff for DateTimes
https://marcogorelli.github.io/polars-xdt-docs/
MIT License
178 stars 8 forks source link

Add `ignore_nulls` (and `ignore_nan`) option to `xdt.ewma_by_time` #71

Open wbeardall opened 5 months ago

wbeardall commented 5 months ago

Currently, if there are any NaN values in the value column passed to xdt.ewma_by_time, then all following values in the output are NaN (see snippet). It would be great if there was n ignore_nulls flag, similar to in the builtin ewma, to allow for NaN or null values to be ignored during calculation, to prevent this. In this case, the presence or absence of a row containing Null or NaN should have no effect on subsequent rows; i.e. the ewma-ed output of the final row of the two following tables should be identical.

shape: (2, 2)
┌───────────┬────────────────────────────┐
│ values    ┆ time                       │
│ ---       ┆ ---                        │
│ f64       ┆ datetime[ns]               │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00        │
│ 0.186466  ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘
shape: (3, 2)
┌───────────┬────────────────────────────┐
│ values    ┆ time                       │
│ ---       ┆ ---                        │
│ f64       ┆ datetime[ns]               │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00        │
│ NaN       ┆ 2000-01-01 00:00:00.000001 │
│ 0.186466  ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘

Reproducible snippet

from datetime import timedelta

import numpy as np
import polars as pl
import polars_xdt as xdt

n = 100

df = pl.DataFrame({
    "values": np.linspace(0, 10, n) + 0.1 * np.random.normal(size=n),
    "time": np.datetime64("2000-01-01 00:00:00") + np.asarray([i*np.timedelta64(1000, "ns") for i in range(n)])
})

new = df.with_columns(xdt.ewma_by_time("values", times="time", half_life=timedelta(microseconds=1)).alias("ewma"))

# True
print(new["ewma"].is_finite().all())

new_with_nan = df.with_columns(xdt.ewma_by_time(
    pl.when(pl.col("values") > 5).then(np.nan).otherwise(pl.col("values")), times="time", half_life=timedelta(microseconds=1)
).alias("ewma"))

# False
print(new_with_nan["ewma"].is_finite().all())
MarcoGorelli commented 5 months ago

thanks @wbeardall for the request! seems reasonable, will take a look

wbeardall commented 5 months ago

I've submitted a PR for how I'd go about implementing this. Let me know any thoughts!

MarcoGorelli commented 5 months ago

similar to in the builtin ewma, to allow for NaN or null values to be ignored during calculation

Are you sure this is what the ewm_mean one does?

In [15]: s = pl.Series([1.1, 2.5, 2.6, 2.1, float('nan'), 5.1])

In [16]: s.ewm_mean(alpha=.1, ignore_nulls=True)
Out[16]:
shape: (6,)
Series: '' [f64]
[
        1.1
        1.836842
        2.11845
        2.113085
        NaN
        NaN
]

Looks like NaN values still propagate there?

Which looks correct to me - Polars (unlike pandas) generally distinguishes NaN and null

wbeardall commented 5 months ago

I think you might be right here; in honesty, I'm a very recent convert from Pandas, and may have mistaken the design pattern in pl.ewm_mean. The main motivation for this issue was to create a feature by which you can prevent NaN propagation in time series data, similar to how Pandas handles the presence of NaN elements with their ignore_na flag (below); this might not necessarily be the same as the ignore_nulls feature of pl.ewm_mean. Seems like this is the same question as yours last night about the need to distinguish between NaN and null values.

>>> import pandas as pd
>>> s = pd.Series([1.1, 2.5, 2.6, 2.1, float('nan'), 5.1])
>>> s.ewm(alpha=.1, ignore_na=True, min_periods=1).mean()
0    1.100000
1    1.836842
2    2.118450
3    2.113085
4    2.113085
5    2.842473
dtype: float64

My particular use-cases, and the PR that I submitted last night, are focused on the ignore_na case, which I would appreciate being a feature; is it worth adding ignore_nulls as another feature under the same or another issue?

wbeardall commented 5 months ago

Perhaps it is better to propagate NaNs, whilst having null values behave as initially written in the PR, and communicate to users that this is the mechanism by which they should enable propagation past such values? e.g.

>>> import polars as pl
>>> s = pl.Series([1.1, 2.5, 2.6, 2.1, float('nan'), 5.1])
>>> s.fill_nan(None).ewm_mean(alpha=.1, ignore_nulls=True)
shape: (6,)
Series: '' [f64]
[
        1.1
        1.836842
        2.11845
        2.113085
        2.113085
        2.842473
]
MarcoGorelli commented 5 months ago

yeah doing s.fill_nan(None) first feels like idiomatic Polars

wbeardall commented 5 months ago

I've pushed an implementation for the above, as well as improving robustness. In the previous version, if a series started with a null value, the kernel would panic, as it would attempt to call .unwrap() on said null; let me know thoughts