pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.95k stars 1.93k forks source link

Exponentially Weighted Moving Average #2148

Closed mhconradt closed 2 years ago

mhconradt commented 2 years ago

Describe your feature request

Many applications in quantitative finance use an exponentially-weighted moving average (called EWMA henceforth) as a smoothing mechanism that gives higher weight to more recent data. I propose adding a suite of functions to Polars along the lines of rolling_ functions that supports EWMA as ewm_mean and in the future could support other metrics such as ewm_std, ewm_var, ewm_corr, ewm_cov, etc. There are a few orthogonal parameters involved here:

  1. Rate of decay of weights: How quickly does the algorithm reduce the weights of older values? alpha, com (center of mass), span, halflife, etc.
  2. Boundary behavior: What does the algorithm do at the beginning of the series? How many values does it need to "see" before outputting a non-null value min_periods?
  3. Adjustment: As described in the Pandas documentation (recommended reading), the recursive definition of an exponentially-weighted moving average converges to the formal definition as the number of values approaches infinity. However, there is a significant difference at first. Take for example an EWMA with alpha=0.5 of values 2, 5, and 3 in that order.

E[0] = 2 / (1 - 0.5) ^ 0 = 2 E[1] = 5 / (1 + (1 - 0.5) ^ 1) + 2 (1 - 0.5) ^ 1 / (1 + (1 - 0.5) ^ 1) = 4 E[2] = (3 + 5 (1 - 0.5) ^ 1 + 2 * (1 - 0.5) ^ 2) / (1 + (1 - 0.5) ^ 1 + (1 - 0.5) ^ 2) = 3.428571

R[0] = 2 R[1] = 0.5 5 + 0.5 R[0] = 3.5 R[2] = 0.5 4 + 0.5 R[1] = 3.75

Note abs(R[1] - E[1])> abs(R[2] - E[2]), the two converge over time. Good news is there's a way to compute E recursively for the EWMA that's not too bad. Bad news is it probably does not work for ewm_std, ewm_corr, etc.

  1. Null behavior: (1) Does the algorithm treat the missing value as zero (reduce the weight of older values but don't add a new value) or ignore the value and use the previous non-null output as output for this position ignore_nulls.

Relevant Pandas source code (fasten your seatbelts, it's Cython). Note it does not use a recursive definition, but rather appears to trim the window before calling the function based on the rate of decay.

import polars as pl

from pyarrow import fs

trades = pl.read_parquet('crypto-exchange-pds/ftx-trades/',
                         filesystem=fs.S3FileSystem(),
                         filters=[('ds', '>=', '2021-10-24')],
                         columns=['market', 'time', 'id', 'price', 'size'])

trades['market'] = trades.market.cast(pl.datatypes.Categorical)
trades.sort(['market', 'id'], in_place=True)

candles_1h = trades.groupby_dynamic('time', every='1h', period='1h',
                                    by='market') \
    .agg([pl.col('price').first().alias('open'),
          pl.col('price').max().alias('high'),
          pl.col('price').min().alias('low'),
          pl.col('price').last().alias('close'),
          pl.col('size').sum().alias('volume'),
          (pl.col('price') * pl.col('size')).sum().alias('quote_volume'),
          ((pl.col('price') * pl.col('size')).sum() / pl.col('size').sum()).alias('vwap')
          ])

above_trend = candles_1h.select(['*', pl.col('close').ewm_mean(span=20).over('market').flatten().alias('ema20')])\
  .filter(pl.col('ema20') > pl.col('close'))

(Edited example to show usage as window function (important that calculation carried out independently for each market))

mhconradt commented 2 years ago

Would be nice to have alias ewma for ewm_mean.

mhconradt commented 2 years ago

@ritchie46 this would be huge for quants, probably the biggest the project is missing in that realm at this point.

mhconradt commented 2 years ago

Might be better to use ew_ prefix as opposed to ewm_.

ritchie46 commented 2 years ago

@ritchie46 this would be huge for quants, probably the biggest the project is missing in that realm at this point.

Then let's do it! :)

I would implement this as a kernel in https://github.com/pola-rs/polars/tree/master/polars/polars-arrow/src/kernels. And then we could dispatch an expression as:

let ca = column.f64()?.rechunk()?.downcast_iter().next().unwrap();
ca.apply_kernel(ewm_mean)?.into_series()

But given that for E[i] we must compute all values before E[i], this scales quadratically, or is there a clever trick?

ritchie46 commented 2 years ago

But given that for E[i] we must compute all values before E[i], this scales quadratically, or is there a clever trick?

I assume you only do this for the window size?

ritchie46 commented 2 years ago

R[2] = 0.5 4 + 0.5 R[1] = 3.75

Should be R[2] = 0.5 * 3 + 0.5 * R[1] = 3.25?

zundertj commented 2 years ago

Agreed that this is one of the most important missing features for quants.

One thing I do wonder, are there no Rust implementations yet you could wrap around? Because a fast (streaming, no need for quadratic scaling) implementation that deals with nulls and fixed window sizes, possibly operate on a separate time axis, can do means, variances, covariances, correlations, etc is quite some work, and not specific to a dataframe library.

ritchie46 commented 2 years ago

One thing I do wonder, are there no Rust implementations yet you could wrap around? Because a fast (streaming, no need for quadratic scaling) implementation that deals with nulls and fixed window sizes, possibly operate on a separate time axis, can do means, variances, covariances, correlations, etc is quite some work, and not specific to a dataframe library.

I already have something. :) (At least only for the means. :see_no_evil: )

mhconradt commented 2 years ago

@ritchie46 I looked at the implementation and it's pretty clever, bravo! It would be good to add min_periods and ignore_na if we can do that without ruining the secret scaling sauce. I estimate ew_mean will get used 100x as much as the other ew_ functions so it's the most important to get right.

ritchie46 commented 2 years ago

Yes, I will add those options as well. And I found that the ewm_std and ewm_var can be easily computed from the ewma, so I will also add those.