Enable `pivot` on LazyFrame and speedup `pivot` on DataFrame

Julian-J-S commented 1 year ago

Description

I understand that currently pivot is not supported for LazyFrame because the schema cannot be known.

However, if I know the schema/columns in advance or only want a subset this should be possible in Lazy mode and could also improve speed in Eager mode.

(In my personal experience you almost always know the columns in advance or only want a specific subset of columns.)

This would require an additional parameter like column_values which would be required for LazyFrame and optional (performance) for DataFrame

Benefits / Advantages:

enable pivot in lazy mode (without fancy groupby workarounds)
speed up pivot in eager mode if values are provided (knowing schema in advance can possibly skip many caclulations/aggreagations/columns)

Examples

df = pl.DataFrame({
    'article': ['pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'teddy bear', 'teddy bear', 'teddy bear', 'teddy bear', 'teddy bear'],
    'price': [1.5, 1.3, 1.2, 0.99, 0.8, 19.95, 14.99, 12.99, 10.0, 7.5],
    'scale': [1, 50, 100, 500, 1000, 1, 20, 100, 200, 1000],
})

Current Way

df.lazy().pivot()
# AttributeError: 'LazyFrame' object has no attribute 'pivot'

df.pivot(
    values='price',
    index='article',
    columns='scale',
)
┌────────────┬───────┬──────┬───────┬──────┬──────┬───────┬──────┐
│ article    ┆ 1     ┆ 50   ┆ 100   ┆ 500  ┆ 1000 ┆ 20    ┆ 200  │
│ ---        ┆ ---   ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---  │
│ str        ┆ f64   ┆ f64  ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64  │
╞════════════╪═══════╪══════╪═══════╪══════╪══════╪═══════╪══════╡
│ pencil     ┆ 1.5   ┆ 1.3  ┆ 1.2   ┆ 0.99 ┆ 0.8  ┆ null  ┆ null │
│ teddy bear ┆ 19.95 ┆ null ┆ 12.99 ┆ null ┆ 7.5  ┆ 14.99 ┆ 10.0 │
└────────────┴───────┴──────┴───────┴──────┴──────┴───────┴──────┘

NEW Way

df.lazy().pivot(
    values='price',
    index='article',
    columns='scale',
    column_values=[1000, 100, 1],  # <<<<< NEW parameter for LazyFrame (and DataFrame)
)
┌────────────┬──────┬───────┬───────┐
│ article    ┆ 1000 ┆ 100   ┆ 1     │
│ ---        ┆ ---  ┆ ---   ┆ ---   │
│ str        ┆ f64  ┆ f64   ┆ f64   │
╞════════════╪══════╪═══════╪═══════╡
│ pencil     ┆ 0.8  ┆ 1.2   ┆ 1.5   │
│ teddy bear ┆ 7.5  ┆ 12.99 ┆ 19.95 │
└────────────┴──────┴───────┴───────┘

pivot on LazyFrame enabled
performance boost in eager mode
get columns in specified order

deanm0000 commented 1 year ago

@alexander-beedie put this in the discord chat a while back

lf = pl.from_repr("""
┌───────────┬──────┬────────┬────────┬──────┐
│ portfolio ┆ year ┆ ticker ┆ price  ┆ size │
│ ---       ┆ ---  ┆ ---    ┆ ---    ┆ ---  │
│ str       ┆ i64  ┆ str    ┆ f64    ┆ i64  │
╞═══════════╪══════╪════════╪════════╪══════╡
│ aaa       ┆ 2019 ┆ AAPL   ┆ 73.41  ┆ 1000 │
│ aaa       ┆ 2020 ┆ AAPL   ┆ 133.72 ┆ 2500 │
│ aaa       ┆ 2021 ┆ AAPL   ┆ 177.57 ┆ 4250 │
│ aaa       ┆ 2019 ┆ IBM    ┆ 128.15 ┆ 1250 │
│ aaa       ┆ 2020 ┆ IBM    ┆ 118.87 ┆ 1800 │
│ aaa       ┆ 2021 ┆ IBM    ┆ 133.66 ┆ 2225 │
│ bbb       ┆ 2021 ┆ AAPL   ┆ 177.57 ┆ 500  │
│ bbb       ┆ 2021 ┆ IBM    ┆ 133.66 ┆ 750  │
│ bbb       ┆ 2020 ┆ AAPL   ┆ 133.72 ┆ 1025 │
└───────────┴──────┴────────┴────────┴──────┘""").lazy()

pfolio, year, ticker, price, size = [pl.col(c) for c in df.columns]

# ---------------------------------
# emulate lazy pivot with group_by
# ---------------------------------
lf.group_by(pfolio, ticker).agg(
    [
        pl.when(year == y).then(price * size).sum().alias(str(y)) 
        for y in (2019, 2020, 2021)
    ]
).sort(pfolio, ticker).collect()

# ┌───────────┬────────┬──────────┬──────────┬──────────┐
# │ portfolio ┆ ticker ┆ 2019     ┆ 2020     ┆ 2021     │
# │ ---       ┆ ---    ┆ ---      ┆ ---      ┆ ---      │
# │ str       ┆ str    ┆ f64      ┆ f64      ┆ f64      │
# ╞═══════════╪════════╪══════════╪══════════╪══════════╡
# │ aaa       ┆ AAPL   ┆ 73410.0  ┆ 334300.0 ┆ 754672.5 │
# │ aaa       ┆ IBM    ┆ 160187.5 ┆ 213966.0 ┆ 297393.5 │
# │ bbb       ┆ AAPL   ┆ 0.0      ┆ 137063.0 ┆ 88785.0  │
# │ bbb       ┆ IBM    ┆ 0.0      ┆ 0.0      ┆ 100245.0 │
# └───────────┴────────┴──────────┴──────────┴──────────┘

So you can basically do it now.

Julian-J-S commented 1 year ago

So you can basically do it now.

yes, this is also described in the documentation but this is exactly what I want to avoid ;) Code is just building blocks and you can basically build everything you want yourself.

Libraries like polars abstract a lot of the boilerplate/complexety behind a approachable, coherent and easy to use API where they do a great job! But there are some edges where this could be improved and this is one of them.

Pivoting is a common use case and a lazy implementation is possible and would benefit a lot of users :)

deanm0000 commented 1 year ago

Fair enough, I didn't realize that was in the docs. I also wasn't trying to argue against the feature request. I was just trying to share that snippet in case it would help.

ritchie46 commented 9 months ago

@MarcoGorelli is this a nice follow up now you touched the pivots? ;)

MarcoGorelli commented 9 months ago

totally

Julian-J-S commented 1 month ago

Any progress / news here? 😊

I see multiple benefits from this:

improved pivot DataFrame performance
enable pivot on LazyFrame
probably enable and also much better for streaming

MarcoGorelli commented 1 month ago

this is nice-to-have, but not something i have capacity to prioritise right now

pola-rs / polars