Open ThomasMAhern opened 12 months ago
What exactly do you mean with by 'making it streamable'?
Counting the exact unique count requires (in the worst case) to have the full dataset materialized.
I would like to add a HyperLogLog-style sketch in the future.
I mean, that seems logical - and likely infeasible, but everything y'all do is magic to me, so I figured maybe there was a way. 🤷♂️
I primarily use polars to work with larger than memory datasets, so I'm trying to .sink_parquet
at every step of the way. Allowing me to work with smaller out-of-memory datasets has pushed me to want to try larger and larger ones. Ideally, I love to use this:
(df
.explode('cars')
.with_columns(n_unique_colors = pl.n_unique('colors').over('cars'),
n_unique_models = pl.n_unique('models').over('cars'))
.sink_parquet('new_file.parquet')
)
@ThomasMAhern What does df
look like?
@orlp
df=pl.DataFrame({'cars':['a','a','b','b','c','c'],
'colors':['x','x','z','x','y','z'],
'models':[1,2,3,1,2,3]}).lazy()
Here's a work around that streams
def join_n_unique(df, n_unique, over, new_name):
if not isinstance(n_unique, list):
n_unique=[n_unique]
if not isinstance(over, list):
over=[over]
return (
df.join(
df
.unique(over+n_unique)
.group_by(over)
.agg(pl.count().alias(new_name)),
on=over
)
)
pl.LazyFrame.join_n_unique=join_n_unique
(
df
.join_n_unique('colors','cars','n_unique_colors')
.join_n_unique('models','cars','n_unique_models')
.sink_parquet('testblah.parquet')
)
Another n_unique request https://github.com/pola-rs/polars/issues/11249
Description
I use
.n_unique()
quite a bit with window functions and I'd love to request that it be made streamable if at all possible! Thank you