pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.44k stars 1.97k forks source link

Enable `pivot` on LazyFrame and speedup `pivot` on DataFrame #11967

Open Julian-J-S opened 1 year ago

Julian-J-S commented 1 year ago

Description

I understand that currently pivot is not supported for LazyFrame because the schema cannot be known.

However, if I know the schema/columns in advance or only want a subset this should be possible in Lazy mode and could also improve speed in Eager mode.

(In my personal experience you almost always know the columns in advance or only want a specific subset of columns.)

This would require an additional parameter like column_values which would be required for LazyFrame and optional (performance) for DataFrame

Benefits / Advantages:

Examples

df = pl.DataFrame({
    'article': ['pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'teddy bear', 'teddy bear', 'teddy bear', 'teddy bear', 'teddy bear'],
    'price': [1.5, 1.3, 1.2, 0.99, 0.8, 19.95, 14.99, 12.99, 10.0, 7.5],
    'scale': [1, 50, 100, 500, 1000, 1, 20, 100, 200, 1000],
})

Current Way

df.lazy().pivot()
# AttributeError: 'LazyFrame' object has no attribute 'pivot'

df.pivot(
    values='price',
    index='article',
    columns='scale',
)
┌────────────┬───────┬──────┬───────┬──────┬──────┬───────┬──────┐
│ article    ┆ 1     ┆ 50   ┆ 100   ┆ 500  ┆ 1000 ┆ 20    ┆ 200  │
│ ---        ┆ ---   ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---  │
│ str        ┆ f64   ┆ f64  ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64  │
╞════════════╪═══════╪══════╪═══════╪══════╪══════╪═══════╪══════╡
│ pencil     ┆ 1.5   ┆ 1.3  ┆ 1.2   ┆ 0.99 ┆ 0.8  ┆ null  ┆ null │
│ teddy bear ┆ 19.95 ┆ null ┆ 12.99 ┆ null ┆ 7.5  ┆ 14.99 ┆ 10.0 │
└────────────┴───────┴──────┴───────┴──────┴──────┴───────┴──────┘

NEW Way

df.lazy().pivot(
    values='price',
    index='article',
    columns='scale',
    column_values=[1000, 100, 1],  # <<<<< NEW parameter for LazyFrame (and DataFrame)
)
┌────────────┬──────┬───────┬───────┐
│ article    ┆ 1000 ┆ 100   ┆ 1     │
│ ---        ┆ ---  ┆ ---   ┆ ---   │
│ str        ┆ f64  ┆ f64   ┆ f64   │
╞════════════╪══════╪═══════╪═══════╡
│ pencil     ┆ 0.8  ┆ 1.2   ┆ 1.5   │
│ teddy bear ┆ 7.5  ┆ 12.99 ┆ 19.95 │
└────────────┴──────┴───────┴───────┘
deanm0000 commented 1 year ago

@alexander-beedie put this in the discord chat a while back

lf = pl.from_repr("""
┌───────────┬──────┬────────┬────────┬──────┐
│ portfolio ┆ year ┆ ticker ┆ price  ┆ size │
│ ---       ┆ ---  ┆ ---    ┆ ---    ┆ ---  │
│ str       ┆ i64  ┆ str    ┆ f64    ┆ i64  │
╞═══════════╪══════╪════════╪════════╪══════╡
│ aaa       ┆ 2019 ┆ AAPL   ┆ 73.41  ┆ 1000 │
│ aaa       ┆ 2020 ┆ AAPL   ┆ 133.72 ┆ 2500 │
│ aaa       ┆ 2021 ┆ AAPL   ┆ 177.57 ┆ 4250 │
│ aaa       ┆ 2019 ┆ IBM    ┆ 128.15 ┆ 1250 │
│ aaa       ┆ 2020 ┆ IBM    ┆ 118.87 ┆ 1800 │
│ aaa       ┆ 2021 ┆ IBM    ┆ 133.66 ┆ 2225 │
│ bbb       ┆ 2021 ┆ AAPL   ┆ 177.57 ┆ 500  │
│ bbb       ┆ 2021 ┆ IBM    ┆ 133.66 ┆ 750  │
│ bbb       ┆ 2020 ┆ AAPL   ┆ 133.72 ┆ 1025 │
└───────────┴──────┴────────┴────────┴──────┘""").lazy()

pfolio, year, ticker, price, size = [pl.col(c) for c in df.columns]

# ---------------------------------
# emulate lazy pivot with group_by
# ---------------------------------
lf.group_by(pfolio, ticker).agg(
    [
        pl.when(year == y).then(price * size).sum().alias(str(y)) 
        for y in (2019, 2020, 2021)
    ]
).sort(pfolio, ticker).collect()

# ┌───────────┬────────┬──────────┬──────────┬──────────┐
# │ portfolio ┆ ticker ┆ 2019     ┆ 2020     ┆ 2021     │
# │ ---       ┆ ---    ┆ ---      ┆ ---      ┆ ---      │
# │ str       ┆ str    ┆ f64      ┆ f64      ┆ f64      │
# ╞═══════════╪════════╪══════════╪══════════╪══════════╡
# │ aaa       ┆ AAPL   ┆ 73410.0  ┆ 334300.0 ┆ 754672.5 │
# │ aaa       ┆ IBM    ┆ 160187.5 ┆ 213966.0 ┆ 297393.5 │
# │ bbb       ┆ AAPL   ┆ 0.0      ┆ 137063.0 ┆ 88785.0  │
# │ bbb       ┆ IBM    ┆ 0.0      ┆ 0.0      ┆ 100245.0 │
# └───────────┴────────┴──────────┴──────────┴──────────┘

So you can basically do it now.

Julian-J-S commented 1 year ago

So you can basically do it now.

yes, this is also described in the documentation but this is exactly what I want to avoid ;) Code is just building blocks and you can basically build everything you want yourself.

Libraries like polars abstract a lot of the boilerplate/complexety behind a approachable, coherent and easy to use API where they do a great job! But there are some edges where this could be improved and this is one of them.

Pivoting is a common use case and a lazy implementation is possible and would benefit a lot of users :)

deanm0000 commented 1 year ago

Fair enough, I didn't realize that was in the docs. I also wasn't trying to argue against the feature request. I was just trying to share that snippet in case it would help.

ritchie46 commented 9 months ago

@MarcoGorelli is this a nice follow up now you touched the pivots? ;)

MarcoGorelli commented 9 months ago

totally

Julian-J-S commented 1 month ago

Any progress / news here? 😊

I see multiple benefits from this:

MarcoGorelli commented 1 month ago

this is nice-to-have, but not something i have capacity to prioritise right now