pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.41k stars 1.97k forks source link

Support parsing timezone-aware datetimes in constructors when data type is also timezone-aware #16297

Closed stinodego closed 5 months ago

stinodego commented 6 months ago

Description

I ran into this today, and I think we can improve behavior here.

Consider this code:

from datetime import datetime
from zoneinfo import ZoneInfo
import polars as pl

data = [datetime(1970, 1, 1, tzinfo=ZoneInfo("Iran"))]
s = pl.Series(data, dtype=pl.Datetime(time_zone="Iran"), strict=False)
ValueError: time-zone-aware datetimes are converted to UTC

Please either drop the time zone from the dtype, or set it to 'UTC'. To convert to a different time zone, please use `.dt.convert_time_zone`.

This is odd. If we're casting timezone-aware data anyway, might as well cast it to desired time zone, right?

One of the benefits of doing this is that timezone-aware data can then be roundtripped, like in one of our tests:

@given(s=series())
def test_to_list(s: pl.Series) -> None:
    values = s.to_list()
    result = pl.Series(values, dtype=s.dtype)
    assert_series_equal(s, result, categorical_as_str=True)

For reference, PyArrow seems to handle this a bit differently from us and they do support this:

import pyarrow as pa

data = [datetime(1970, 1, 1, tzinfo=ZoneInfo("Iran"))]
arr = pa.array(data, type=pa.timestamp("us", tz="Iran"))

print(arr)  # 1969-12-31 20:30:00.000000Z
print(arr.type)  # timestamp[us, tz=Iran]

I may be missing something here, but I thought I'd throw this out there. Let's see what @MarcoGorelli thinks 😄

MarcoGorelli commented 6 months ago

thanks for the ping

so, for .str.to_datetime, the current rules are:

The constructor should probably not be too different. Let's see:

For the last one, I think you're suggesting to convert to the given time zone. So long as it's clearly documented, and it's done for both the Series constructor and .str.to_datetime, I think it makes sense

MarcoGorelli commented 5 months ago

I've taken another look at PyArrow, and there is something else probably worth mirroring

For .str.to_datetime, if the strings are offset-aware, then they convert to UTC. Just like Polars does 👍:

In [8]: pc.strptime(pa.array(['2020-01-01T01:02:03+01:00']), unit='us', format='%Y-%m-%dT%H:%M:%S%z').type
Out[8]: TimestampType(timestamp[us, tz=UTC])

In [11]: pl.Series(['2020-01-01T01:02:03+01:00']).str.to_datetime(time_unit='us').dtype
Out[11]: Datetime(time_unit='us', time_zone='UTC')

But, for in the constuctor, when starting from a tz-aware stdlib datetime object, they take the time zone of the first such object:

In [14]: pa.array([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).type
Out[14]: TimestampType(timestamp[us, tz=+01:00])

In [15]: pl.Series([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).dtype
<ipython-input-15-9332dc369f5e>:1: TimeZoneAwareConstructorWarning: Constructing a Series with time-zone-aware datetimes results in a Series with UTC time zone. To silence this warning, you can filter warnings of class TimeZoneAwareConstructorWarning, or set 'UTC' as the time zone of your datatype.
  pl.Series([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).dtype
Out[15]: Datetime(time_unit='us', time_zone='UTC')

whereas Polars still converts to UTC

One suggestion could be:


There's a further difference though. If the user specifies the time zone as part of the dtype, then Polars sets that as the dtype, whereas PyArrow converts as if starting from UTC:

In [25]: pa.array([datetime(2020, 1, 1), datetime(2020, 1, 2)], type=pa.timestamp('us', 'Iran'))
Out[25]:
<pyarrow.lib.TimestampArray object at 0x7f2d3b4e5c60>
[
  2020-01-01 00:00:00.000000Z,
  2020-01-02 00:00:00.000000Z
]

In [26]: pl.Series([datetime(2020, 1, 1), datetime(2020, 1, 2)], dtype=pl.Datetime('us', 'Iran'))
Out[26]:
shape: (2,)
Series: '' [datetime[μs, Iran]]
[
        2020-01-01 00:00:00 +0330
        2020-01-02 00:00:00 +0330
]

It looks like their rule is:


Where does this leave Polars? Not totally sure, just wanted to leave these findings here for now


Something which currently doesn't look great (and is unintuitive?) is this:

In [22]: pl.Series([datetime(2020, 1, 1), datetime(2020, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu'))], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
Out[22]:
shape: (2,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
        2020-01-01 00:00:00 CET
        2019-12-31 18:15:00 CET
]

The second element gets converted to 'UTC', but then its time zone is replaced with 'Europe/Amsterdam'


OK, got a concrete proposal in https://github.com/pola-rs/polars/pull/16828. It addresses several inconsistencies, but in doing so is unfortunately breaking for some people. In those cases, however, a clear warning is issued, advising the user about what to do instead