pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.48k stars 1.87k forks source link

group_by_dynamic with offset computes wrong window starts when a DST time change happens just before the 1st window #15966

Open michelbl opened 5 months ago

michelbl commented 5 months ago

Checks

Reproducible example

from datetime import datetime, timedelta, UTC

import polars as pl

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(2023, 3, 26, 14, 56, tzinfo=UTC),
                    datetime(2023, 3, 27, 14, 56, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [4, 5],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(2023, 10, 29, 14, 56, tzinfo=UTC),
                    datetime(2023, 10, 30, 14, 56, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [4, 5],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

Log output

No output

Issue description

When a DST time change ("spring forward" or "fall back") happens between midnight and the first window start, then all the window starts are not offset correctly.

shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-03-26 07:00:00 CEST   ┆ 4   │
│ 2023-03-27 07:00:00 CEST   ┆ 5   │
└────────────────────────────┴─────┘
shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-10-29 05:00:00 CET    ┆ 4   │
│ 2023-10-30 05:00:00 CET    ┆ 5   │
└────────────────────────────┴─────┘

Expected behavior

shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-03-26 06:00:00 CEST   ┆ 4   │
│ 2023-03-27 06:00:00 CEST   ┆ 5   │
└────────────────────────────┴─────┘
shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-10-29 06:00:00 CET    ┆ 4   │
│ 2023-10-30 06:00:00 CET    ┆ 5   │
└────────────────────────────┴─────┘

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35 Python: 3.11.4 (main, Jun 26 2023, 15:13:33) [GCC 11.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2023.10.0 gevent: hvplot: matplotlib: 3.8.2 nest_asyncio: 1.5.8 numpy: 1.26.2 openpyxl: 3.1.2 pandas: 2.1.3 pyarrow: 11.0.0 pydantic: 2.5.1 pyiceberg: pyxlsb: sqlalchemy: 2.0.23 xlsx2csv: xlsxwriter: ```
MarcoGorelli commented 5 months ago

thanks for the report, will take a look

MarcoGorelli commented 5 months ago

Looking at this more closely, I think the current behaviour is correct

Or at least, it's behaving as documented. The docs say

‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.

And indeed:

 [16]: pl.Series([datetime(2023, 3, 26, 16, 56)], dtype=pl.Datetime('us', 'Europe/Paris')).dt.truncate('1d').dt.offset_by('6h')
Out[16]:
shape: (1,)
Series: '' [datetime[μs, Europe/Paris]]
[
        2023-03-26 07:00:00 CEST
]
MarcoGorelli commented 4 months ago

closing then as this looks expected, but thanks for the report! please do reach out if anything else trips you up

michelbl commented 4 months ago

Hi @MarcoGorelli , thanks for you analysis. I was in holidays so I didn't answer in time. However, here are a few arguments you might want to consider.

Let's look at that example:

from datetime import datetime, timedelta, UTC

import polars as pl

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
                    datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
                    datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [10, 11, 12],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

that produces the following result:

┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2025-01-01 06:00:00 CET    ┆ 21  │
│ 2025-01-02 06:00:00 CET    ┆ 12  │
└────────────────────────────┴─────┘

Now, I mischievously add a single data point 28 years back in time with the value 0:

from datetime import datetime, timedelta, UTC

import polars as pl

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(1997, 3, 30, 14, 56, tzinfo=UTC),
                    datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
                    datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
                    datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [0, 10, 11, 12],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

and now the results are all messed up:

┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 1997-03-30 07:00:00 CEST   ┆ 0   │
│ 2024-12-31 07:00:00 CET    ┆ 10  │
│ 2025-01-01 07:00:00 CET    ┆ 23  │
└────────────────────────────┴─────┘

As we can see, how the data is aggregated depends not only on the parameters, but also on the data itself. A single data point affects the whole result. I cannot think of any use-case where that is a desired behavior. Moreover, I believe it conflicts with the expectation that the windows are fully defined by the arguments offset and every and would not vary depending on the data to aggregate.

Furthermore, I don't fully agree that this is a documented behavior. Let's follow the documented recipe:

‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.

  • Our first timestamp is 1997-03-30 16:56 Europe/Paris (which append to be in summer time)
  • Truncated by day: 1997-03-30 00:00 Europe/Paris (now we are in winter time)
  • Adding an offset of 6 hours. Do you add 6 "wall time" hours or 6 "absolute time" hours? In the former case, the result is 1997-03-30 06:00 Europe/Paris (summer time), in the later case the result is 1997-03-30 07:00 Europe/Paris. Since the two semantics are completely valid on their own, there is now way to tell which one is used.
  • Since the computations of all the subsequent time windows use the "wall time" semantics (DST hour gets added/removed), it is reasonable to assume that the first window start is also using the "wall time" semantics. But this is not the case.

I also think that the consequences are hard to foresee (I deployed that code in production for months, fully unaware of this edge case.)

If you still think that this behavior is suitable in some circumstances, then adding a warning in the documentation would at least prevent users from making wrong assumptions.

MarcoGorelli commented 4 months ago

thanks for providing more context

.dt.offset_by mentions which durations are "calendar" ones, and which not: https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.dt.offset_by.html#polars.Expr.dt.offset_by . That should probably be linked

MarcoGorelli commented 4 months ago

To be honest, the default of -every for offset has been annoying me for some time

It may be a good idea to redesign this a bit - it may need to be a breaking change as part of the 1.0 release, but if it's done right, it'll be for the better