pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.67k stars 17.92k forks source link

ENH: Allow storing timezone-aware datetimes in a series with a datetime64 dtype #46998

Open thehomebrewnerd opened 2 years ago

thehomebrewnerd commented 2 years ago

Is your feature request related to a problem?

I wish I could use pandas to store a column of timezone-aware datetime values with different timezones in a series with a datetime64 dtype. In certain applications it is desirable to perform operations on all columns of a certain type, and currently a column with mixed types gets stored as object which makes it difficult to programmatically identify the column as containing datetime values based on the dtype and the object dtype prevents doing things like accessing the day of the datetime with the .dt accessor.

Describe the solution you'd like

I would like to have the ability to store a series of timezone aware values with mixed timezones and use the .dt accessor to access the underlying datetime components:

mixed_tz_series = pd.Series([
    pd.to_datetime("2018-03-01").tz_localize(tz="US/Pacific"),
    pd.to_datetime("2018-03-01").tz_localize(tz="US/Central"),
    pd.to_datetime("2018-03-01").tz_localize(tz="Europe/Vienna"),
], dtype="datetime64[ns]")

mixed_tz_series.dt.day

API breaking implications

None that I'm aware of.

Describe alternatives you've considered

Instead of using the .dt accessor on the series, one could use apply with a lambda function (or other function) to get at the underlying date components, but this does not address the fact that the series is not stored with a datetime dtype, making it more difficult to determine that the datetime operations could/should be applied to the column.

jreback commented 2 years ago

this would require a dedicated extension type - so it's possible

certainly a well tested community provided PR would be reviewable by core

i don't see a huge clamor for this in any event

jorisvandenbossche commented 2 years ago

This could indeed be done through an ExtensionArray, but I would say that for that reason it could perfectly be done in an external package that provides this, instead of having it in pandas itself?

gsheni commented 2 years ago

@jorisvandenbossche Wouldn't pandas benefit from having native support for this? I imagine 3 scenarios with datetimes:

  1. datetimes values with no timezone info (timezone naive)
  2. datetimes values with timezone info (timezone aware) a. all the same timezone b. different timezones

Some examples of realistic datasets where multiple timezones in 1 column could show up:

  1. A dataset with taxi trips in a city that includes a column of pickup_datetimes. a. A taxi rider is picked up in PDT, dropped off, and another taxi rider is picked up in EDT

  2. A dataset where you record business opening times for 1 year: a. You could have 1 column with EDT and EST datetimes

jorisvandenbossche commented 2 years ago

a. You could have 1 column with EDT and EST datetimes

I also mentioned this in https://issues.apache.org/jira/browse/ARROW-16540, and the same applies for pandas: a mixture of datetimes with or without DST is considered as the same timezone. (although in pandas we don't have a method like "is_dst" to know which values in the column are using DST, but that could be a feature request).

Wouldn't pandas benefit from having native support for this?

To be clear, I am not saying that there are no use cases for this, or that it can't be useful for users of pandas. But not everything that is useful needs to be included in pandas itself. It is always a trade-off between including something in pandas vs having a third party package that provides additional functionality on top of pandas.

cp2boston commented 2 years ago

@jreback @jorisvandenbossche I was running the above example with Python 3.8.12 and pandas 1.2.2

mixed_tz_series = pd.Series([
    pd.to_datetime("2018-03-01 09:25:00").tz_localize(tz="US/Eastern"),
    pd.to_datetime("2018-03-01  09:25:00").tz_localize(tz="US/Pacific"),
    pd.to_datetime("2018-03-01  09:25:00").tz_localize(tz="US/Central"),
    pd.to_datetime("2018-03-01  09:25:00").tz_localize(tz="Europe/Vienna"),
], dtype="datetime64[ns]")

print(mixed_tz_series._data)

And it returns

SingleBlockManager
Items: RangeIndex(start=0, stop=4, step=1)
DatetimeBlock: 4 dtype: datetime64[ns]

In pandas 1.4.3 it is now stored as an object.

SingleBlockManager
Items: RangeIndex(start=0, stop=4, step=1)
ObjectBlock: 4 dtype: object

The change appears to have occurred between 1.2.5 (still datetime64[ns]) and 1.3.0 (now an object).

Would you know why the dtype was changed to an object, even though the dtype itself is specified?