Open thehomebrewnerd opened 2 years ago
this would require a dedicated extension type - so it's possible
certainly a well tested community provided PR would be reviewable by core
i don't see a huge clamor for this in any event
This could indeed be done through an ExtensionArray, but I would say that for that reason it could perfectly be done in an external package that provides this, instead of having it in pandas itself?
@jorisvandenbossche Wouldn't pandas benefit from having native support for this? I imagine 3 scenarios with datetimes:
Some examples of realistic datasets where multiple timezones in 1 column could show up:
A dataset with taxi trips in a city that includes a column of pickup_datetimes
.
a. A taxi rider is picked up in PDT, dropped off, and another taxi rider is picked up in EDT
A dataset where you record business opening times for 1 year: a. You could have 1 column with EDT and EST datetimes
a. You could have 1 column with EDT and EST datetimes
I also mentioned this in https://issues.apache.org/jira/browse/ARROW-16540, and the same applies for pandas: a mixture of datetimes with or without DST is considered as the same timezone. (although in pandas we don't have a method like "is_dst" to know which values in the column are using DST, but that could be a feature request).
Wouldn't pandas benefit from having native support for this?
To be clear, I am not saying that there are no use cases for this, or that it can't be useful for users of pandas. But not everything that is useful needs to be included in pandas itself. It is always a trade-off between including something in pandas vs having a third party package that provides additional functionality on top of pandas.
@jreback @jorisvandenbossche I was running the above example with Python 3.8.12 and pandas 1.2.2
mixed_tz_series = pd.Series([
pd.to_datetime("2018-03-01 09:25:00").tz_localize(tz="US/Eastern"),
pd.to_datetime("2018-03-01 09:25:00").tz_localize(tz="US/Pacific"),
pd.to_datetime("2018-03-01 09:25:00").tz_localize(tz="US/Central"),
pd.to_datetime("2018-03-01 09:25:00").tz_localize(tz="Europe/Vienna"),
], dtype="datetime64[ns]")
print(mixed_tz_series._data)
And it returns
SingleBlockManager
Items: RangeIndex(start=0, stop=4, step=1)
DatetimeBlock: 4 dtype: datetime64[ns]
In pandas 1.4.3 it is now stored as an object
.
SingleBlockManager
Items: RangeIndex(start=0, stop=4, step=1)
ObjectBlock: 4 dtype: object
The change appears to have occurred between 1.2.5
(still datetime64[ns]
) and 1.3.0
(now an object
).
Would you know why the dtype
was changed to an object
, even though the dtype
itself is specified?
Is your feature request related to a problem?
I wish I could use pandas to store a column of timezone-aware datetime values with different timezones in a series with a
datetime64
dtype. In certain applications it is desirable to perform operations on all columns of a certain type, and currently a column with mixed types gets stored asobject
which makes it difficult to programmatically identify the column as containing datetime values based on the dtype and theobject
dtype prevents doing things like accessing the day of the datetime with the.dt
accessor.Describe the solution you'd like
I would like to have the ability to store a series of timezone aware values with mixed timezones and use the
.dt
accessor to access the underlying datetime components:API breaking implications
None that I'm aware of.
Describe alternatives you've considered
Instead of using the
.dt
accessor on the series, one could useapply
with a lambda function (or other function) to get at the underlying date components, but this does not address the fact that the series is not stored with adatetime
dtype, making it more difficult to determine that the datetime operations could/should be applied to the column.