Closed stinodego closed 5 months ago
thanks for the ping
so, for .str.to_datetime
, the current rules are:
In [17]: pl.Series(['2020-01-01T01:23:45']).str.to_datetime(time_zone='Asia/Kathmandu')
Out[17]:
shape: (1,)
Series: '' [datetime[μs, Asia/Kathmandu]]
[
2020-01-01 01:23:45 +0545
]
In [18]: pl.Series(['2020-01-01T01:23:45+05:45']).str.to_datetime()
Out[18]:
shape: (1,)
Series: '' [datetime[μs, UTC]]
[
2019-12-31 19:38:45 UTC
]
ComputeError: offset-aware datetimes are converted to UTC. Please either drop the time zone from the function call, or set it to UTC. To convert to a different time zone, please use `convert_time_zone`.
The constructor should probably not be too different. Let's see:
In [20]: pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Asia/Kathmandu'))
Out[20]:
shape: (1,)
Series: '' [datetime[μs, Asia/Kathmandu]]
[
2020-01-01 00:00:00 +0545
]
In [21]: pl.Series([datetime(2020, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu'))], dtype=pl.Datetime)
<ipython-input-21-3f19ea488109>:1: TimeZoneAwareConstructorWarning: Constructing a Series with time-zone-aware datetimes results in a Series with UTC time zone. To silence this warning, you can filter warnings of class TimeZoneAwareConstructorWarning, or set 'UTC' as the time zone of your datatype.
pl.Series([datetime(2020, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu'))], dtype=pl.Datetime)
Out[21]:
shape: (1,)
Series: '' [datetime[μs, UTC]]
[
2019-12-31 18:15:00 UTC
]
For the last one, I think you're suggesting to convert to the given time zone. So long as it's clearly documented, and it's done for both the Series constructor and .str.to_datetime
, I think it makes sense
I've taken another look at PyArrow, and there is something else probably worth mirroring
For .str.to_datetime
, if the strings are offset-aware, then they convert to UTC. Just like Polars does 👍:
In [8]: pc.strptime(pa.array(['2020-01-01T01:02:03+01:00']), unit='us', format='%Y-%m-%dT%H:%M:%S%z').type
Out[8]: TimestampType(timestamp[us, tz=UTC])
In [11]: pl.Series(['2020-01-01T01:02:03+01:00']).str.to_datetime(time_unit='us').dtype
Out[11]: Datetime(time_unit='us', time_zone='UTC')
But, for in the constuctor, when starting from a tz-aware stdlib datetime
object, they take the time zone of the first such object:
In [14]: pa.array([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).type
Out[14]: TimestampType(timestamp[us, tz=+01:00])
In [15]: pl.Series([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).dtype
<ipython-input-15-9332dc369f5e>:1: TimeZoneAwareConstructorWarning: Constructing a Series with time-zone-aware datetimes results in a Series with UTC time zone. To silence this warning, you can filter warnings of class TimeZoneAwareConstructorWarning, or set 'UTC' as the time zone of your datatype.
pl.Series([datetime(2020, 1, 1, tzinfo=timezone(timedelta(hours=1))), datetime(2020, 1, 2)]).dtype
Out[15]: Datetime(time_unit='us', time_zone='UTC')
whereas Polars still converts to UTC
One suggestion could be:
datetime
objects, then take the timezone of the first onetimezone(timedelta(minutes=47))
), then raiseThere's a further difference though. If the user specifies the time zone as part of the dtype, then Polars sets that as the dtype, whereas PyArrow converts as if starting from UTC:
In [25]: pa.array([datetime(2020, 1, 1), datetime(2020, 1, 2)], type=pa.timestamp('us', 'Iran'))
Out[25]:
<pyarrow.lib.TimestampArray object at 0x7f2d3b4e5c60>
[
2020-01-01 00:00:00.000000Z,
2020-01-02 00:00:00.000000Z
]
In [26]: pl.Series([datetime(2020, 1, 1), datetime(2020, 1, 2)], dtype=pl.Datetime('us', 'Iran'))
Out[26]:
shape: (2,)
Series: '' [datetime[μs, Iran]]
[
2020-01-01 00:00:00 +0330
2020-01-02 00:00:00 +0330
]
It looks like their rule is:
pa.timestamp
, then convert (don't replace) to thattype
, then convert to the time zone of the first non-null elementWhere does this leave Polars? Not totally sure, just wanted to leave these findings here for now
Something which currently doesn't look great (and is unintuitive?) is this:
In [22]: pl.Series([datetime(2020, 1, 1), datetime(2020, 1, 1, tzinfo=ZoneInfo('Asia/Kathmandu'))], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
Out[22]:
shape: (2,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
2020-01-01 00:00:00 CET
2019-12-31 18:15:00 CET
]
The second element gets converted to 'UTC'
, but then its time zone is replaced with 'Europe/Amsterdam'
OK, got a concrete proposal in https://github.com/pola-rs/polars/pull/16828. It addresses several inconsistencies, but in doing so is unfortunately breaking for some people. In those cases, however, a clear warning is issued, advising the user about what to do instead
Description
I ran into this today, and I think we can improve behavior here.
Consider this code:
This is odd. If we're casting timezone-aware data anyway, might as well cast it to desired time zone, right?
One of the benefits of doing this is that timezone-aware data can then be roundtripped, like in one of our tests:
For reference, PyArrow seems to handle this a bit differently from us and they do support this:
I may be missing something here, but I thought I'd throw this out there. Let's see what @MarcoGorelli thinks 😄