pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.48k stars 1.87k forks source link

Allow streaming for .n_unique() #11512

Open ThomasMAhern opened 12 months ago

ThomasMAhern commented 12 months ago

Description

I use .n_unique() quite a bit with window functions and I'd love to request that it be made streamable if at all possible! Thank you

orlp commented 12 months ago

What exactly do you mean with by 'making it streamable'?

Counting the exact unique count requires (in the worst case) to have the full dataset materialized.

I would like to add a HyperLogLog-style sketch in the future.

ThomasMAhern commented 12 months ago

I mean, that seems logical - and likely infeasible, but everything y'all do is magic to me, so I figured maybe there was a way. 🤷‍♂️ I primarily use polars to work with larger than memory datasets, so I'm trying to .sink_parquet at every step of the way. Allowing me to work with smaller out-of-memory datasets has pushed me to want to try larger and larger ones. Ideally, I love to use this:

(df
  .explode('cars')
  .with_columns(n_unique_colors = pl.n_unique('colors').over('cars'), 
                n_unique_models = pl.n_unique('models').over('cars'))
  .sink_parquet('new_file.parquet')
)
orlp commented 12 months ago

@ThomasMAhern What does df look like?

deanm0000 commented 12 months ago

@orlp

df=pl.DataFrame({'cars':['a','a','b','b','c','c'],
                 'colors':['x','x','z','x','y','z'],
                 'models':[1,2,3,1,2,3]}).lazy()

Here's a work around that streams

def join_n_unique(df, n_unique, over, new_name):
    if not isinstance(n_unique, list):
        n_unique=[n_unique]
    if not isinstance(over, list):
        over=[over]
    return (
        df.join(
            df
                .unique(over+n_unique)
                .group_by(over)
                .agg(pl.count().alias(new_name)),
            on=over
        )
    )
pl.LazyFrame.join_n_unique=join_n_unique
(
    df
    .join_n_unique('colors','cars','n_unique_colors')
    .join_n_unique('models','cars','n_unique_models')
    .sink_parquet('testblah.parquet')
)
deanm0000 commented 12 months ago

Another n_unique request https://github.com/pola-rs/polars/issues/11249