pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.42k stars 1.97k forks source link

Sampling with groupby #16725

Open stevenlis opened 5 months ago

stevenlis commented 5 months ago

Description

Polars lacks support for sampling within each group after a groupby, unlike pandas, which offers a similar feature:

https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html

import polars as pl

df = pl.DataFrame(
    {
        'id': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
        'num': [6, 7, 8, 1, 4, 5, 2]
    }
).group_by('id').sample(n=1)
shape: (3, 2)
┌─────┬─────┐
│ id  ┆ num │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 6   │
│ B   ┆ 1   │
│ C   ┆ 2   │
└─────┴─────┘
connor-elliott commented 5 months ago

df.group_by('id').agg(col('num').shuffle().head(n)) ?

Leo-Lee15 commented 2 weeks ago

Polars actually has this feature. For your reference, https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.map_groups.html

df = pl.DataFrame(
    {
        "id": [0, 1, 2, 3, 4],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
    }
)
# not recommended
df.group_by("color").map_groups(
    lambda group_df: group_df.sample(2)
)  

# recommended

df.filter(
    pl.int_range(pl.len()).shuffle(seed=42).over("color") < 2
)