Upsample fills time_column but not groups

david-waterworth commented 5 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 2, 3),
            datetime(2021, 2, 4),
            datetime(2021, 2, 5),
        ],
        "groups": ["A", "A", "A", "A"],
        "values": [0, 1, 2, 3],
    }
).set_sorted("time")

df.upsample(
    time_column="time", every="1d", group_by="groups", maintain_order=True
)

shape: (5, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ i64    │
╞═════════════════════╪════════╪════════╡
│ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
│ 2021-02-02 00:00:00 ┆ null   ┆ null   │
│ 2021-02-03 00:00:00 ┆ A      ┆ 1      │
│ 2021-02-04 00:00:00 ┆ A      ┆ 2      │
│ 2021-02-05 00:00:00 ┆ A      ┆ 3      │
└─────────────────────┴────────┴────────┘

Log output

No response

Issue description

The time columns has been filled, but the groups column is null. I expected that the group_by column would be set to the value of the group, i.e. the following does what I want (i.e. I had to manually interpolate the group - I'm not 100% sure this works for edge cases though).

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 2, 3),
            datetime(2021, 2, 4),
            datetime(2021, 2, 5),
        ],
        "groups": ["A", "A", "A", "A"],
        "values": [0, 1, 2, 3],
    }
).set_sorted("time")

df.upsample(
    time_column="time", every="1d", group_by="groups", maintain_order=True
).with_columns(pl.col("groups").forward_fill())

shape: (5, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ i64    │
╞═════════════════════╪════════╪════════╡
│ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
│ 2021-02-02 00:00:00 ┆ A      ┆ null   │
│ 2021-02-03 00:00:00 ┆ A      ┆ 1      │
│ 2021-02-04 00:00:00 ┆ A      ┆ 2      │
│ 2021-02-05 00:00:00 ┆ A      ┆ 3      │
└─────────────────────┴────────┴────────┘

Expected behavior

when you upsample over a group, the new records should include non-null values for both the time_column and group_by column.

Installed versions

``` Replace this line with the output of pl.show_versions(). Leave the backticks in place. ```

matt-netiz commented 3 months ago

Hi,

I encountered the same problem. I noticed this issue has been open for over two months without any updates. Could you provide an update on the current status or any potential timeline for addressing this issue?

Thank you!

cmdlineluser commented 3 months ago

@matt-netiz Yes, it looks like this particular issue slipped through unfortunately.

If it helps, I believe upsample is essentially syntax sugar for a range explosion and a join:

(df.group_by("groups", maintain_order=True)
   .agg(pl.datetime_range(pl.col("time").min(), pl.col("time").max(), interval="1d"))
   .explode("time")
   .join(df, on=["groups", "time"], how="left")
)

# shape: (5, 3)
# ┌────────┬─────────────────────┬────────┐
# │ groups ┆ time                ┆ values │
# │ ---    ┆ ---                 ┆ ---    │
# │ str    ┆ datetime[μs]        ┆ i64    │
# ╞════════╪═════════════════════╪════════╡
# │ A      ┆ 2021-02-01 00:00:00 ┆ 0      │
# │ A      ┆ 2021-02-02 00:00:00 ┆ null   │
# │ A      ┆ 2021-02-03 00:00:00 ┆ 1      │
# │ A      ┆ 2021-02-04 00:00:00 ┆ 2      │
# │ A      ┆ 2021-02-05 00:00:00 ┆ 3      │
# └────────┴─────────────────────┴────────┘

I think the problem may be that the group_by column isn't included in the join:

https://github.com/pola-rs/polars/blob/30a5534087a543ed5a501e977c27e59a6894ddc2/crates/polars-time/src/upsample.rs#L227

matt-netiz commented 3 months ago

That solves my issue, thanks for the quick response!

cmdlineluser commented 3 months ago

Perhaps @MarcoGorelli can take a look and confirm when they are free.