pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.28k stars 1.85k forks source link

Upsample fills time_column but not groups #15530

Open david-waterworth opened 5 months ago

david-waterworth commented 5 months ago

Checks

Reproducible example

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 2, 3),
            datetime(2021, 2, 4),
            datetime(2021, 2, 5),
        ],
        "groups": ["A", "A", "A", "A"],
        "values": [0, 1, 2, 3],
    }
).set_sorted("time")

df.upsample(
    time_column="time", every="1d", group_by="groups", maintain_order=True
)
shape: (5, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ i64    │
╞═════════════════════╪════════╪════════╡
│ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
│ 2021-02-02 00:00:00 ┆ null   ┆ null   │
│ 2021-02-03 00:00:00 ┆ A      ┆ 1      │
│ 2021-02-04 00:00:00 ┆ A      ┆ 2      │
│ 2021-02-05 00:00:00 ┆ A      ┆ 3      │
└─────────────────────┴────────┴────────┘

Log output

No response

Issue description

The time columns has been filled, but the groups column is null. I expected that the group_by column would be set to the value of the group, i.e. the following does what I want (i.e. I had to manually interpolate the group - I'm not 100% sure this works for edge cases though).

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 2, 3),
            datetime(2021, 2, 4),
            datetime(2021, 2, 5),
        ],
        "groups": ["A", "A", "A", "A"],
        "values": [0, 1, 2, 3],
    }
).set_sorted("time")

df.upsample(
    time_column="time", every="1d", group_by="groups", maintain_order=True
).with_columns(pl.col("groups").forward_fill())
shape: (5, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ i64    │
╞═════════════════════╪════════╪════════╡
│ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
│ 2021-02-02 00:00:00 ┆ A      ┆ null   │
│ 2021-02-03 00:00:00 ┆ A      ┆ 1      │
│ 2021-02-04 00:00:00 ┆ A      ┆ 2      │
│ 2021-02-05 00:00:00 ┆ A      ┆ 3      │
└─────────────────────┴────────┴────────┘

Expected behavior

when you upsample over a group, the new records should include non-null values for both the time_column and group_by column.

Installed versions

``` Replace this line with the output of pl.show_versions(). Leave the backticks in place. ```
matt-netiz commented 3 months ago

Hi,

I encountered the same problem. I noticed this issue has been open for over two months without any updates. Could you provide an update on the current status or any potential timeline for addressing this issue?

Thank you!

cmdlineluser commented 3 months ago

@matt-netiz Yes, it looks like this particular issue slipped through unfortunately.

If it helps, I believe upsample is essentially syntax sugar for a range explosion and a join:

(df.group_by("groups", maintain_order=True)
   .agg(pl.datetime_range(pl.col("time").min(), pl.col("time").max(), interval="1d"))
   .explode("time")
   .join(df, on=["groups", "time"], how="left")
)

# shape: (5, 3)
# ┌────────┬─────────────────────┬────────┐
# │ groups ┆ time                ┆ values │
# │ ---    ┆ ---                 ┆ ---    │
# │ str    ┆ datetime[μs]        ┆ i64    │
# ╞════════╪═════════════════════╪════════╡
# │ A      ┆ 2021-02-01 00:00:00 ┆ 0      │
# │ A      ┆ 2021-02-02 00:00:00 ┆ null   │
# │ A      ┆ 2021-02-03 00:00:00 ┆ 1      │
# │ A      ┆ 2021-02-04 00:00:00 ┆ 2      │
# │ A      ┆ 2021-02-05 00:00:00 ┆ 3      │
# └────────┴─────────────────────┴────────┘

I think the problem may be that the group_by column isn't included in the join:

https://github.com/pola-rs/polars/blob/30a5534087a543ed5a501e977c27e59a6894ddc2/crates/polars-time/src/upsample.rs#L227

matt-netiz commented 3 months ago

That solves my issue, thanks for the quick response!

cmdlineluser commented 3 months ago

Perhaps @MarcoGorelli can take a look and confirm when they are free.

MarcoGorelli commented 3 months ago

thanks for the report - I think I agree, the groups should also be filled in

lanzalibre commented 1 month ago

Is this a duplicate of https://github.com/pola-rs/polars/issues/14131 ?

MarcoGorelli commented 1 month ago

Is this a duplicate of #14131 ?

i don't think so