pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.84k stars 1.82k forks source link

Sample by Group #16726

Open stevenlis opened 2 months ago

stevenlis commented 2 months ago

Description

Sometimes one may need to sample a dataframe by group. For example, if I have three IDs and I want to randomly select two of them and keep only their rows. I found that some users have tried this in pandas and dplyr, but the idea is to get all unique ID values, sample from them, and then filter the subset.

import polars as pl

df = pl.DataFrame(
    {
        'id': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
        'num': [6, 7, 8, 1, 4, 5, 2]
    }
)

df = df.filter(
    pl.col('id').is_in(
        df.select(pl.col('id')).unique().sample(n=2)
    )
)
shape: (4, 2)
┌─────┬─────┐
│ id  ┆ num │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ A   ┆ 6   │
│ A   ┆ 7   │
│ C   ┆ 5   │
│ C   ┆ 2   │
└─────┴─────┘

Consider adding a parameter to the .sample() method to simplify this process.

avimallu commented 2 months ago

For now, you can do:

df.with_columns(pl.int_range(0, pl.len()).shuffle().over("group") == 2)
stevenlis commented 2 months ago

@avimallu This approach doesn't work because IDs have different rows, and it also doesn't subset the dataset at all.

avimallu commented 2 months ago

Sorry, you should use filter, not with_columns.

stevenlis commented 2 months ago

@avimallu I think there's been a misinterpretation. This FR is requesting the selection of n random IDs and all associated rows with those IDs. I'm struggling to understand the purpose of your code. I already provided an expected results example above.