Open etiennebacher opened 1 year ago
FYI this is the best I have so far:
import polars as pl
import time
df = pl.DataFrame(
{
"orig": ["France", "France", "UK", "UK", "Spain"],
"dest": ["Japan", "Vietnam", "Japan", "China", "China"],
"year": [2020, 2021, 2019, 2020, 2022],
"value": [1, 2, 3, 4, 5],
}
)
(
df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
.explode("orig")
.explode("dest")
.explode("year")
.join(df, how="left", on=["orig", "dest", "year"])
)
Problem description
The R package
tidyr
has a very useful function calledcomplete()
that creates all missing combinations of variables. Here's an example:We had 3 unique values in
orig
, 3 indest
and 4 inyear
so we end up with 3x3x4 = 36 combinations. Combinations that didn't exist are filled withNA
(even thoughtidyr::complete()
has some args to fill them).I don't think there's an equivalent function in
polars
for now but I think it would be useful to have something like this. I asked in on StackOverflow and the only answer so far is to use repeated crossjoins before joining with the original data, which seems to work and to be reasonably fast. However, one neat feature oftidyr::complete()
is that it automatically sorts the output by the variables specified. When I tried tosort()
after the repeated crossjoins, the timing went 2x as high astidyr::complete()
.It was suggested to me to propose this as a feature request, so here it is. Thanks for all your work!
Python code to remake the data + what I have so far: