pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.5k stars 1.87k forks source link

Respect time zone in pl.lit - follow-up to #6991 #6992

Closed MarcoGorelli closed 1 year ago

MarcoGorelli commented 1 year ago

Problem description

This follows-up from a discussion started here https://github.com/pola-rs/polars/pull/6991#discussion_r1110963272 , cc @alexander-beedie

In short, what should pl.select(pl.lit(datetime(2020, 1, 1), dtype=pl.Datetime('us', 'Asia/Kathmandu'))) do?

I think the following two should return the same timestamp:

Given

In [7]: pl.Series(['2020-01-01']).str.strptime(pl.Datetime('us', 'Asia/Kathmandu'))
Out[7]: 
shape: (1,)
Series: '' [datetime[ΞΌs, Asia/Kathmandu]]
[
        2020-01-01 00:00:00 +0545
]

I'd expect pl.select(pl.lit(datetime(2020, 1, 1), dtype=pl.Datetime('us', 'Asia/Kathmandu'))) to just set the 'Asia/Kathmandu' time zone (as opposed to doing any implicit conversion from UTC)

Furthermore, given that the following errors

>>> pl.Series(['2020-01-01 Z']).str.strptime(pl.Datetime('us', 'Asia/Kathmandu'), '%Y-%m-%d %#z')
ComputeError: Cannot use strptime with both 'tz_aware=True' and tz-aware Datetime.

I'd expect

pl.select(pl.lit(datetime(2020, 1, 1, tzinfo=dt.timezone.utc), dtype=pl.Datetime('us', 'Asia/Kathmandu')))

to also error

alexander-beedie commented 1 year ago

It's a good question :)

This update inside def lit(...) could work as suggested...

if isinstance(value, datetime):
    tu = "us" if dtype is None else getattr(dtype, "tu", "us")
    e = lit(_datetime_to_pl_timestamp(value, tu)).cast(Datetime(tu))

    dtype_tz = dtype and getattr(dtype, "tz", None)
    if value.tzinfo is not None or dtype_tz:
        return e.dt.replace_time_zone(dtype_tz or str(value.tzinfo))

    return e

...though we should survey other uses of _datetime_to_pl_timestamp to ensure we're being consistent πŸ€”

With the update in place we'd get the following:

d = datetime( 2023,1,1, tzinfo=ZoneInfo("Asia/Tokyo") )
pl.DataFrame({
    "d1": [d],
    "d2": pl.select( pl.lit(d, dtype=pl.Datetime("ms")) ).to_series(),
    "d3": pl.select( pl.lit(d, dtype=pl.Datetime("ns","Europe/Berlin")) ).to_series(),
})
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ d1                       ┆ d2                       ┆ d3                          β”‚
# β”‚ ---                      ┆ ---                      ┆ ---                         β”‚
# β”‚ datetime[ΞΌs, Asia/Tokyo] ┆ datetime[ms, Asia/Tokyo] ┆ datetime[ns, Europe/Berlin] β”‚
# β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════════β•ͺ═════════════════════════════║
# β”‚ 2023-01-01 00:00:00 JST  ┆ 2023-01-01 00:00:00 JST  ┆ 2023-01-01 00:00:00 CET     β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Definitely better than the current behaviour, where the given dtype timezone info gets ignored:

# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ d1                       ┆ d2                       ┆ d3                       β”‚
# β”‚ ---                      ┆ ---                      ┆ ---                      β”‚
# β”‚ datetime[ΞΌs, Asia/Tokyo] ┆ datetime[ΞΌs, Asia/Tokyo] ┆ datetime[ns, Asia/Tokyo] β”‚
# β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════════β•ͺ══════════════════════════║
# β”‚ 2023-01-01 00:00:00 JST  ┆ 2023-01-01 00:00:00 JST  ┆ 2023-01-01 00:00:00 JST  β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I think I'm in favour of your suggested/expected result. If providing a timezone in the lit dtype, it does look like "take this value and give it this dtype" - I'm not sure I'd expect implicit conversions either...

MarcoGorelli commented 1 year ago

Thanks for looking into this

I think this'd be consistent - the other public-facing place I see this being used is in date_range, where it also casts (rather than doing implicit UTC conversions):

In [3]: pl.date_range(datetime(2022, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu')), datetime(2022, 1, 2, tzinfo=ZoneInfo('Asi
   ...: a/Kathmandu')))
Out[3]:
shape: (2,)
Series: '' [datetime[ΞΌs, Asia/Kathmandu]]
[
        2022-01-01 00:00:00 +0545
        2022-01-02 00:00:00 +0545
]

In [6]: pl.date_range(datetime(2022, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu')), datetime(2022, 1, 2, tzinfo=ZoneInfo('Asi
   ...: a/Kathmandu')), time_zone='Europe/London')
---------------------------------------------------------------------------

ValueError: Given time_zone is different from that timezone aware datetimes. Given: 'Europe/London', got: 'Asia/Kathmandu'.
MarcoGorelli commented 1 year ago

@alexander-beedie here you go https://github.com/pola-rs/polars/pull/6999