Open Julian-J-S opened 1 year ago
@alexander-beedie put this in the discord chat a while back
lf = pl.from_repr("""
┌───────────┬──────┬────────┬────────┬──────┐
│ portfolio ┆ year ┆ ticker ┆ price ┆ size │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 ┆ i64 │
╞═══════════╪══════╪════════╪════════╪══════╡
│ aaa ┆ 2019 ┆ AAPL ┆ 73.41 ┆ 1000 │
│ aaa ┆ 2020 ┆ AAPL ┆ 133.72 ┆ 2500 │
│ aaa ┆ 2021 ┆ AAPL ┆ 177.57 ┆ 4250 │
│ aaa ┆ 2019 ┆ IBM ┆ 128.15 ┆ 1250 │
│ aaa ┆ 2020 ┆ IBM ┆ 118.87 ┆ 1800 │
│ aaa ┆ 2021 ┆ IBM ┆ 133.66 ┆ 2225 │
│ bbb ┆ 2021 ┆ AAPL ┆ 177.57 ┆ 500 │
│ bbb ┆ 2021 ┆ IBM ┆ 133.66 ┆ 750 │
│ bbb ┆ 2020 ┆ AAPL ┆ 133.72 ┆ 1025 │
└───────────┴──────┴────────┴────────┴──────┘""").lazy()
pfolio, year, ticker, price, size = [pl.col(c) for c in df.columns]
# ---------------------------------
# emulate lazy pivot with group_by
# ---------------------------------
lf.group_by(pfolio, ticker).agg(
[
pl.when(year == y).then(price * size).sum().alias(str(y))
for y in (2019, 2020, 2021)
]
).sort(pfolio, ticker).collect()
# ┌───────────┬────────┬──────────┬──────────┬──────────┐
# │ portfolio ┆ ticker ┆ 2019 ┆ 2020 ┆ 2021 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ f64 ┆ f64 ┆ f64 │
# ╞═══════════╪════════╪══════════╪══════════╪══════════╡
# │ aaa ┆ AAPL ┆ 73410.0 ┆ 334300.0 ┆ 754672.5 │
# │ aaa ┆ IBM ┆ 160187.5 ┆ 213966.0 ┆ 297393.5 │
# │ bbb ┆ AAPL ┆ 0.0 ┆ 137063.0 ┆ 88785.0 │
# │ bbb ┆ IBM ┆ 0.0 ┆ 0.0 ┆ 100245.0 │
# └───────────┴────────┴──────────┴──────────┴──────────┘
So you can basically do it now.
So you can basically do it now.
yes, this is also described in the documentation but this is exactly what I want to avoid ;) Code is just building blocks and you can basically build everything you want yourself.
Libraries like polars abstract a lot of the boilerplate/complexety behind a approachable, coherent and easy to use API where they do a great job! But there are some edges where this could be improved and this is one of them.
Pivoting is a common use case and a lazy implementation is possible and would benefit a lot of users :)
Fair enough, I didn't realize that was in the docs. I also wasn't trying to argue against the feature request. I was just trying to share that snippet in case it would help.
@MarcoGorelli is this a nice follow up now you touched the pivots? ;)
totally
Any progress / news here? 😊
I see multiple benefits from this:
this is nice-to-have, but not something i have capacity to prioritise right now
Description
I understand that currently
pivot
is not supported for LazyFrame because the schema cannot be known.However, if I know the schema/columns in advance or only want a subset this should be possible in Lazy mode and could also improve speed in Eager mode.
(In my personal experience you almost always know the columns in advance or only want a specific subset of columns.)
This would require an additional parameter like
column_values
which would be required for LazyFrame and optional (performance) for DataFrameBenefits / Advantages:
pivot
in lazy mode (without fancy groupby workarounds)pivot
in eager mode if values are provided (knowing schema in advance can possibly skip many caclulations/aggreagations/columns)Examples
Current Way
NEW Way
pivot
on LazyFrame enabled