Open michelbl opened 5 months ago
thanks for the report, will take a look
Looking at this more closely, I think the current behaviour is correct
Or at least, it's behaving as documented. The docs say
‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
And indeed:
[16]: pl.Series([datetime(2023, 3, 26, 16, 56)], dtype=pl.Datetime('us', 'Europe/Paris')).dt.truncate('1d').dt.offset_by('6h')
Out[16]:
shape: (1,)
Series: '' [datetime[μs, Europe/Paris]]
[
2023-03-26 07:00:00 CEST
]
closing then as this looks expected, but thanks for the report! please do reach out if anything else trips you up
Hi @MarcoGorelli , thanks for you analysis. I was in holidays so I didn't answer in time. However, here are a few arguments you might want to consider.
Let's look at that example:
from datetime import datetime, timedelta, UTC
import polars as pl
print(
pl.DataFrame(
data={
"t": pl.Series(
[
datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
]
)
.dt.cast_time_unit("ms")
.dt.convert_time_zone("Europe/Paris"),
"q": [10, 11, 12],
}
)
.set_sorted("t")
.group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
.agg([pl.sum("q").alias("q")])
)
that produces the following result:
┌────────────────────────────┬─────┐
│ t ┆ q │
│ --- ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2025-01-01 06:00:00 CET ┆ 21 │
│ 2025-01-02 06:00:00 CET ┆ 12 │
└────────────────────────────┴─────┘
Now, I mischievously add a single data point 28 years back in time with the value 0:
from datetime import datetime, timedelta, UTC
import polars as pl
print(
pl.DataFrame(
data={
"t": pl.Series(
[
datetime(1997, 3, 30, 14, 56, tzinfo=UTC),
datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
]
)
.dt.cast_time_unit("ms")
.dt.convert_time_zone("Europe/Paris"),
"q": [0, 10, 11, 12],
}
)
.set_sorted("t")
.group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
.agg([pl.sum("q").alias("q")])
)
and now the results are all messed up:
┌────────────────────────────┬─────┐
│ t ┆ q │
│ --- ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 1997-03-30 07:00:00 CEST ┆ 0 │
│ 2024-12-31 07:00:00 CET ┆ 10 │
│ 2025-01-01 07:00:00 CET ┆ 23 │
└────────────────────────────┴─────┘
As we can see, how the data is aggregated depends not only on the parameters, but also on the data itself. A single data point affects the whole result. I cannot think of any use-case where that is a desired behavior. Moreover, I believe it conflicts with the expectation that the windows are fully defined by the arguments offset
and every
and would not vary depending on the data to aggregate.
Furthermore, I don't fully agree that this is a documented behavior. Let's follow the documented recipe:
‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
- Our first timestamp is 1997-03-30 16:56 Europe/Paris (which append to be in summer time)
- Truncated by day: 1997-03-30 00:00 Europe/Paris (now we are in winter time)
- Adding an offset of 6 hours. Do you add 6 "wall time" hours or 6 "absolute time" hours? In the former case, the result is 1997-03-30 06:00 Europe/Paris (summer time), in the later case the result is 1997-03-30 07:00 Europe/Paris. Since the two semantics are completely valid on their own, there is now way to tell which one is used.
- Since the computations of all the subsequent time windows use the "wall time" semantics (DST hour gets added/removed), it is reasonable to assume that the first window start is also using the "wall time" semantics. But this is not the case.
I also think that the consequences are hard to foresee (I deployed that code in production for months, fully unaware of this edge case.)
If you still think that this behavior is suitable in some circumstances, then adding a warning in the documentation would at least prevent users from making wrong assumptions.
thanks for providing more context
.dt.offset_by
mentions which durations are "calendar" ones, and which not: https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.dt.offset_by.html#polars.Expr.dt.offset_by . That should probably be linked
To be honest, the default of -every
for offset
has been annoying me for some time
It may be a good idea to redesign this a bit - it may need to be a breaking change as part of the 1.0 release, but if it's done right, it'll be for the better
Checks
Reproducible example
Log output
Issue description
When a DST time change ("spring forward" or "fall back") happens between midnight and the first window start, then all the window starts are not offset correctly.
Expected behavior
Installed versions