pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.24k stars 1.95k forks source link

Add `residual` method #12760

Open wukan1986 opened 11 months ago

wukan1986 commented 11 months ago

Description

In quantitative investment, it is often necessary to market value neutralize and industry neutralize the data, which requires the use of multiple linear regression to find the residuals. I tried implementing it but the performance was low. So I hope it implement in Rust.

my UDF:

import time
from datetime import datetime
from typing import Sequence

import numpy as np
import polars as pl

def neutralize_residual(cols: Sequence[pl.Series]):
    cols = [c.to_numpy() for c in cols]
    y = cols[0]
    x = cols[1:]
    A = np.vstack(x).T
    coef = np.linalg.lstsq(A, y, rcond=None)[0]
    y_hat = np.sum(A * coef, axis=1)
    residual = y - y_hat
    return pl.Series(residual)

def func(df: pl.DataFrame):
    df = df.with_columns([
        pl.map_batches(['y', 'constant', 'x1', 'x2'], neutralize_residual).alias('residual'),
    ])
    return df

if __name__ == '__main__':
    date = pl.datetime_range(datetime(2000, 1, 1), datetime(2023, 1, 1), eager=True)
    date = pl.concat([date] * 10)

    df = pl.DataFrame({
        "date": date,
        'industry': pl.int_range(0, len(date), eager=True) % 30,
        'constant': 1.0,
        "y": pl.int_range(0, len(date), eager=True) * 1.0 + np.random.randn(len(date), ),
        "x1": pl.int_range(0, len(date), eager=True) * 2.0 + np.random.randn(len(date), ),
        "x2": pl.int_range(0, len(date), eager=True) * 3.0 + np.random.randn(len(date), ),
    })
    print(df)

    start_time = time.perf_counter()

    # too slow
    df = df.group_by(by=['date', 'industry']).map_groups(func)

    end_time = time.perf_counter()
    # executed in 32.886850799957756 seconds in my PC
    print(f"executed in {end_time - start_time} seconds")

    print(df.head())

I expect method like this:

pl.col('y').residual(['constant', 'x1', 'x2']).alias('residual')

FYI: https://github.com/pola-rs/polars/issues/7994 https://github.com/abstractqqq/polars_ds_extension/blob/main/src/num_ext/ols.rs#L39 https://github.com/abstractqqq/polars_ds_extension/blob/main/tests/test_ext.py#L222

orlp commented 11 months ago

I feel this would be better suited to a plugin for now. Maybe at some point in the future we'll support linear regressions, but it would definitely be something more generic than a residual method.

wukan1986 commented 11 months ago

Hi, @orlp

In linear regression, it seems that only residual and y_hat have the same shape as y, coef has a different shape.

For quantitative researchers, residual is the most important

mutecamel commented 11 months ago

I found this https://stackoverflow.com/a/74906705/1894479

azmyrajab commented 7 months ago

Hi! I recently released a polars extension package: polars-ols dedicated to supporting linear regressions in rust exposed as polars expressions.

So it’ll do exactly what you want with mode=“residuals” and it’ll be really fast in the context of your cross sectional regressions as all will run natively in parallel in rust. It also supports regularization, sample weighting, null handling etc. which may be also relevant for your problem.

If you can try it out: pip install polars-ols, let me know if works for you