Open ThomasMAhern opened 1 year ago
What exactly do you mean with by 'making it streamable'?
Counting the exact unique count requires (in the worst case) to have the full dataset materialized.
I would like to add a HyperLogLog-style sketch in the future.
I mean, that seems logical - and likely infeasible, but everything y'all do is magic to me, so I figured maybe there was a way. 🤷♂️
I primarily use polars to work with larger than memory datasets, so I'm trying to .sink_parquet
at every step of the way. Allowing me to work with smaller out-of-memory datasets has pushed me to want to try larger and larger ones. Ideally, I love to use this:
(df
.explode('cars')
.with_columns(n_unique_colors = pl.n_unique('colors').over('cars'),
n_unique_models = pl.n_unique('models').over('cars'))
.sink_parquet('new_file.parquet')
)
@ThomasMAhern What does df
look like?
@orlp
df=pl.DataFrame({'cars':['a','a','b','b','c','c'],
'colors':['x','x','z','x','y','z'],
'models':[1,2,3,1,2,3]}).lazy()
Here's a work around that streams
def join_n_unique(df, n_unique, over, new_name):
if not isinstance(n_unique, list):
n_unique=[n_unique]
if not isinstance(over, list):
over=[over]
return (
df.join(
df
.unique(over+n_unique)
.group_by(over)
.agg(pl.count().alias(new_name)),
on=over
)
)
pl.LazyFrame.join_n_unique=join_n_unique
(
df
.join_n_unique('colors','cars','n_unique_colors')
.join_n_unique('models','cars','n_unique_models')
.sink_parquet('testblah.parquet')
)
Another n_unique request https://github.com/pola-rs/polars/issues/11249
Description
I use
.n_unique()
quite a bit with window functions and I'd love to request that it be made streamable if at all possible! Thank you