pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.21k stars 1.95k forks source link

Create missing combinations of variables (like `tidyr::complete()` in R) #9722

Open etiennebacher opened 1 year ago

etiennebacher commented 1 year ago

Problem description

The R package tidyr has a very useful function called complete() that creates all missing combinations of variables. Here's an example:

test <- data.frame(
  orig = c("France", "France", "UK", "UK", "Spain"),
  dest = c("Japan", "Vietnam", "Japan", "China", "China"),
  year = c(2020, 2021, 2019, 2020, 2022),
  value = c(1, 2, 3, 4, 5)
)
test
#>     orig    dest year value
#> 1 France   Japan 2020     1
#> 2 France Vietnam 2021     2
#> 3     UK   Japan 2019     3
#> 4     UK   China 2020     4
#> 5  Spain   China 2022     5

tidyr::complete(test, orig, dest, year)
#> # A tibble: 36 × 4
#>    orig   dest     year value
#>    <chr>  <chr>   <dbl> <dbl>
#>  1 France China    2019    NA
#>  2 France China    2020    NA
#>  3 France China    2021    NA
#>  4 France China    2022    NA
#>  5 France Japan    2019    NA
#>  6 France Japan    2020     1
#>  7 France Japan    2021    NA
#>  8 France Japan    2022    NA
#>  9 France Vietnam  2019    NA
#> 10 France Vietnam  2020    NA
#> # ℹ 26 more rows

We had 3 unique values in orig, 3 in dest and 4 in year so we end up with 3x3x4 = 36 combinations. Combinations that didn't exist are filled with NA (even though tidyr::complete() has some args to fill them).


I don't think there's an equivalent function in polars for now but I think it would be useful to have something like this. I asked in on StackOverflow and the only answer so far is to use repeated crossjoins before joining with the original data, which seems to work and to be reasonably fast. However, one neat feature of tidyr::complete() is that it automatically sorts the output by the variables specified. When I tried to sort() after the repeated crossjoins, the timing went 2x as high as tidyr::complete().

It was suggested to me to propose this as a feature request, so here it is. Thanks for all your work!


Python code to remake the data + what I have so far:

import polars as pl

df = pl.DataFrame({
        "orig": ["France", "France", "UK", "UK", "Spain"],
        "dest": ["Japan", "Vietnam", "Japan", "China", "China"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
})
(
    df
    .select("orig")
    .unique()
    .join(df.select("dest").unique(), how="cross")
    .join(df.select("year").unique(), how="cross")
    .join(df, how="left", on=["orig", "dest", "year"])
    .sort(["orig", "dest", "year"])
)

shape: (36, 4)
┌────────┬─────────┬──────┬───────┐
│ orig   ┆ dest    ┆ year ┆ value │
│ ---    ┆ ---     ┆ ---  ┆ ---   │
│ str    ┆ str     ┆ i64  ┆ i64   │
╞════════╪═════════╪══════╪═══════╡
│ France ┆ China   ┆ 2019 ┆ null  │
│ France ┆ China   ┆ 2020 ┆ null  │
│ France ┆ China   ┆ 2021 ┆ null  │
│ France ┆ China   ┆ 2022 ┆ null  │
│ …      ┆ …       ┆ …    ┆ …     │
│ UK     ┆ Vietnam ┆ 2019 ┆ null  │
│ UK     ┆ Vietnam ┆ 2020 ┆ null  │
│ UK     ┆ Vietnam ┆ 2021 ┆ null  │
│ UK     ┆ Vietnam ┆ 2022 ┆ null  │
└────────┴─────────┴──────┴───────┘
etiennebacher commented 1 year ago

FYI this is the best I have so far:

import polars as pl
import time

df = pl.DataFrame(
    {
        "orig": ["France", "France", "UK", "UK", "Spain"],
        "dest": ["Japan", "Vietnam", "Japan", "China", "China"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
    }
)

(
    df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
    .explode("orig")
    .explode("dest")
    .explode("year")
    .join(df, how="left", on=["orig", "dest", "year"])
)