pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.32k stars 17.8k forks source link

Pandas date_range error when the date falls in a DST starting day. #30378

Open dfonnegra opened 4 years ago

dfonnegra commented 4 years ago

Code Sample

import pandas as pd
pd.date_range(start=pd.Timestamp('2016-10-05 00:00:00'), end=pd.Timestamp('2016-10-30 00:00:00'), freq='1D', tz='America/Sao_Paulo')

Problem description

[When I try to generate a range of daily data with America/Sao_Paulo timezone, it breaks with NonExistentTimeError: 2016-10-16 00:00:00. My pandas version is 0.25.3]

dfonnegra commented 4 years ago

I iterated through all timezones and got that this list of timezones fail too:

America/Asuncion America/Campo_Grande America/Cuiaba America/Havana America/Punta_Arenas America/Santiago America/Sao_Paulo America/Scoresbysund Antarctica/Casey Antarctica/Palmer Asia/Amman Asia/Beirut Asia/Damascus Asia/Gaza Asia/Hebron Asia/Tehran Atlantic/Azores Brazil/East Chile/Continental Cuba Iran

TomAugspurger commented 4 years ago

What's the expected behavior here? If the user is trying to create an invalid datetime, shouldn't we raise?

dfonnegra commented 4 years ago

The same behaviour used when this exact same bug happened with the resample method in the 0.23.x version. Check this in the 0.24.0 changelong

TomAugspurger commented 4 years ago

What was the resolution there? Can you post the actual expected output in the original post?

mroeschke commented 4 years ago

In the resample case, an assumption was made on the behavior of ambiguous and nonexistent timezones since the resulting index is a transformation from the original index (i.e the binning).

In the date_range case, I agree with Tom's assessment https://github.com/pandas-dev/pandas/issues/30378#issuecomment-568010280 in which the idiomatic alternative is:

pd.date_range(start=pd.Timestamp('2016-10-05 00:00:00'), end=pd.Timestamp('2016-10-30 00:00:00'), freq='1D').tz_localize('America/Sao_Paulo', nonexistent=...)

I think ultimately the tz parameter should be deprecated from date_range

kdebrab commented 4 months ago

In the resample case, an assumption was made on the behavior of ambiguous and nonexistent timezones since the resulting index is a transformation from the original index (i.e the binning).

In the date_range case, I agree with Tom's assessment #30378 (comment) in which the idiomatic alternative is:

pd.date_range(start=pd.Timestamp('2016-10-05 00:00:00'), end=pd.Timestamp('2016-10-30 00:00:00'), freq='1D').tz_localize('America/Sao_Paulo', nonexistent=...)

I think ultimately the tz parameter should be deprecated from date_range

I don't think that deprecating the tz parameter is the right solution, as one gets exactly the same error without using the tz parameter, but when start and end themselves are timezone-aware:

start = pd.Timestamp("2016-10-05").tz_localize("America/Sao_Paulo")
end = pd.Timestamp("2016-10-30").tz_localize("America/Sao_Paulo")
pd.date_range(start=start, end=end, freq="D")

What's the expected behavior here? If the user is trying to create an invalid datetime, shouldn't we raise?

I think date_range should make the same assumptions as were made in resample, such that the above returns the same as:

pd.Series(1, index=pd.date_range(start, end, freq="h")).resample("D").sum().index

which returns:

DatetimeIndex(['2016-10-05 00:00:00-03:00', '2016-10-06 00:00:00-03:00',
               '2016-10-07 00:00:00-03:00', '2016-10-08 00:00:00-03:00',
               '2016-10-09 00:00:00-03:00', '2016-10-10 00:00:00-03:00',
               '2016-10-11 00:00:00-03:00', '2016-10-12 00:00:00-03:00',
               '2016-10-13 00:00:00-03:00', '2016-10-14 00:00:00-03:00',
               '2016-10-15 00:00:00-03:00', '2016-10-16 01:00:00-02:00',
               '2016-10-17 00:00:00-02:00', '2016-10-18 00:00:00-02:00',
               '2016-10-19 00:00:00-02:00', '2016-10-20 00:00:00-02:00',
               '2016-10-21 00:00:00-02:00', '2016-10-22 00:00:00-02:00',
               '2016-10-23 00:00:00-02:00', '2016-10-24 00:00:00-02:00',
               '2016-10-25 00:00:00-02:00', '2016-10-26 00:00:00-02:00',
               '2016-10-27 00:00:00-02:00', '2016-10-28 00:00:00-02:00',
               '2016-10-29 00:00:00-02:00', '2016-10-30 00:00:00-02:00'],
              dtype='datetime64[ns, America/Sao_Paulo]', freq='D')

Notice that the index for 2016-10-16 is shifted by 1 hour.