Allow streaming for .n_unique()

pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

https://docs.pola.rs

Other

30.58k stars 1.99k forks source link

Allow streaming for .n_unique() #11512

Open ThomasMAhern opened 1 year ago

ThomasMAhern commented 1 year ago

Description

I use .n_unique() quite a bit with window functions and I'd love to request that it be made streamable if at all possible! Thank you

orlp commented 1 year ago

What exactly do you mean with by 'making it streamable'?

Counting the exact unique count requires (in the worst case) to have the full dataset materialized.

I would like to add a HyperLogLog-style sketch in the future.

ThomasMAhern commented 1 year ago

I mean, that seems logical - and likely infeasible, but everything y'all do is magic to me, so I figured maybe there was a way. 🤷‍♂️ I primarily use polars to work with larger than memory datasets, so I'm trying to .sink_parquet at every step of the way. Allowing me to work with smaller out-of-memory datasets has pushed me to want to try larger and larger ones. Ideally, I love to use this:

(df
  .explode('cars')
  .with_columns(n_unique_colors = pl.n_unique('colors').over('cars'), 
                n_unique_models = pl.n_unique('models').over('cars'))
  .sink_parquet('new_file.parquet')
)

orlp commented 1 year ago

@ThomasMAhern What does df look like?

deanm0000 commented 1 year ago

@orlp

df=pl.DataFrame({'cars':['a','a','b','b','c','c'],
                 'colors':['x','x','z','x','y','z'],
                 'models':[1,2,3,1,2,3]}).lazy()

Here's a work around that streams

def join_n_unique(df, n_unique, over, new_name):
    if not isinstance(n_unique, list):
        n_unique=[n_unique]
    if not isinstance(over, list):
        over=[over]
    return (
        df.join(
            df
                .unique(over+n_unique)
                .group_by(over)
                .agg(pl.count().alias(new_name)),
            on=over
        )
    )
pl.LazyFrame.join_n_unique=join_n_unique
(
    df
    .join_n_unique('colors','cars','n_unique_colors')
    .join_n_unique('models','cars','n_unique_models')
    .sink_parquet('testblah.parquet')
)

deanm0000 commented 1 year ago

Another n_unique request https://github.com/pola-rs/polars/issues/11249