pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

polars cumsum over a group is slower than pandas groupby cumsum #5801

Open ChristopherRussell opened 1 year ago

ChristopherRussell commented 1 year ago

Problem description

Hi, I'm wondering if there is room to improve this functionality?

import pandas as pd
import numpy as np
import polars as pl

df = pd.DataFrame({'val': np.random.randn(5_000_000), 'key': ['a', 'b', 'c', 'd', 'e'] * 1_000_000})
df.key = df.key.astype('category')

%%time
df.groupby('key').val.cumsum()
# takes 50ms

pdf = pl.from_pandas(df)

%%time
pdf.select(pl.col('val').cumsum().over('key'))
# takes 200ms
ritchie46 commented 1 year ago

You are doing a window function, not a groupby?

ChristopherRussell commented 1 year ago

New here - perhaps I don't understand how to achieve the same result as the pandas groupby cumsum in polar. This snippet: df.groupby('key').val.cumsum() returns a 'transformed' groupby, so is the same length as the dataframe has number of rows.

cmdlineluser commented 1 year ago

You can .groupby().agg().explode() which gets closer to the pandas time.

>>> %timeit pdf.select(pl.col('val').cumsum().over('key'))
189 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pdf.groupby("key").agg(pl.col("val").cumsum()).explode("val")
117 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.groupby('key').val.cumsum()
58.2 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
ChristopherRussell commented 1 year ago

Thanks for the alternative solution @cmdlineluser, useful for as a polars newbie. It is a little different in that it changes the order of the output (which could be useful sometimes). Still a bit curious as to the difference in performance. Have found polars to outperform all other areas so far!

cmdlineluser commented 1 year ago

groupby can maintain the original order if desired: .groupby("key", maintain_order=True) - hopefully others with more knowledge than myself can comment on the performance.

desprado2 commented 10 months ago

groupby can maintain the original order if desired: .groupby("key", maintain_order=True) - hopefully others with more knowledge than myself can comment on the performance.

It seems that .group_by("key", maintain_order=True).agg(pl.col("val").cumsum()).explode("key") is still different from .select(pl.col('val').cumsum().over('key')), since the former will group the keys. For example, if key=['a', 'b', 'a', 'b'], then the former will change the order into ['a', 'a', 'b', 'b'] while the latter will not.

Is there a better equivalent expression?

cmdlineluser commented 10 months ago

@desprado2 Very true, I missed that at the time - my bad.

Are you having performance issues with .select(pl.col('val').cumsum().over('key')) ?

Trying this now I get:

PANDAS     Elapsed time: 0.2451 seconds
PANDAS CAT Elapsed time: 0.2743 seconds
POLARS     Elapsed time: 0.2847 seconds
POLARS CAT Elapsed time: 0.1968 seconds
desprado2 commented 10 months ago

@desprado2 Very true, I missed that at the time - my bad.

Are you having performance issues with .select(pl.col('val').cumsum().over('key')) ?

Trying this now I get:

PANDAS     Elapsed time: 0.2451 seconds
PANDAS CAT Elapsed time: 0.2743 seconds
POLARS     Elapsed time: 0.2847 seconds
POLARS CAT Elapsed time: 0.1968 seconds

It seems that polars is still slower:

import numpy as np
import pandas as pd
import polars as pl

large_pdf = pd.DataFrame({"val": np.random.randn(5_000_000), "key": np.random.randint(114, 514, (5_000_000))})
large_pdf.set_index("key", inplace=True)
large_plf = pl.from_pandas(large_pdf, include_index=True)

%%time
large_plf.lazy().select(pl.col("val").cumsum().over("key")).collect()
# CPU times: user 316 ms, sys: 145 ms, total: 461 ms Wall time: 168 ms

%%time
large_pdf.groupby("key")["val"].cumsum()
# CPU times: user 70.7 ms, sys: 3.75 ms, total: 74.5 ms Wall time: 73.5 ms
cmdlineluser commented 10 months ago

Here's an attempted benchmark that runs everything in isolation:

Code ```python import time import numpy as np import pandas as pd import polars as pl def create_df(): return pd.DataFrame({"val": np.random.randn(N), "key": np.random.randint(114, 514, (N))}) def pandas_with_index(): df = create_df() df.set_index("key", inplace=True) start = time.perf_counter() df.groupby("key")["val"].cumsum() return time.perf_counter() - start def pandas_with_cat_index(): df = create_df() df.key = df.key.astype("category") df.set_index("key", inplace=True) start = time.perf_counter() df.groupby("key")["val"].cumsum() return time.perf_counter() - start def pandas_without_index(): df = create_df() start = time.perf_counter() df.groupby("key")["val"].cumsum() return time.perf_counter() - start def polars_with_index(): df = create_df() df.set_index("key", inplace=True) df = pl.from_pandas(df, include_index=True) start = time.perf_counter() df.select(pl.col("val").cumsum().over("key")) return time.perf_counter() - start def polars_with_cat_index(): df = create_df() df.key = df.key.astype("category") df.set_index("key", inplace=True) df = pl.from_pandas(df, include_index=True) start = time.perf_counter() df.select(pl.col("val").cumsum().over("key")) return time.perf_counter() - start def polars_without_index(): df = create_df() df = pl.from_pandas(df, include_index=True) start = time.perf_counter() df.select(pl.col("val").cumsum().over("key")) return time.perf_counter() - start N = 20_000_000 pkgs = "pandas", "polars" tests = [ "with_index", "with_cat_index", "without_index" ] results = [] for pkg in pkgs: for test in tests: name = f"{pkg}_{test}" print(name) try: took = globals()[name]() results.append( {"name": name, "time": took} ) except pl.ComputeError: print(f"[WARNING]: skipping test `{name}`") print( pl.DataFrame(results) ) ```

Results

shape: (5, 2)
┌───────────────────────┬──────────┐
│ name                  ┆ time     │
│ ---                   ┆ ---      │
│ str                   ┆ f64      │
╞═══════════════════════╪══════════╡
│ pandas_with_index     ┆ 0.165075 │
│ pandas_with_cat_index ┆ 0.079915 │
│ pandas_without_index  ┆ 0.162487 │
│ polars_with_index     ┆ 0.363737 │
│ polars_without_index  ┆ 0.34091  │
└───────────────────────┴──────────┘

The original approach @ChristopherRussell posted with the category index df.key.astype('category') seems to be special-cased by pandas (I couldn't convert that using pl.from_pandas so had to skip it)

Perhaps @ritchie46 could take a look at the performance side of things when he has time.