pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.77k stars 17.96k forks source link

bug in tz_localize when summer/winter time change #8531

Closed gbakalian closed 6 years ago

gbakalian commented 10 years ago

Works with datetime.datetime, does not work with pandas.tslib.Timestamp. Encounter the issue when trying to resample hourly a series.

import pandas as pd
from datetime import datetime as dt
import pytz

tz = 'Europe/Berlin'
d = dt(2014,10,26,02,00)
tz = pytz.timezone(tz)
tz.localize(d)

x = pd.Series(0., [d]).index[0]
x.tz_localize(tz)
jreback commented 10 years ago

try this using 0.15.0rc1 http://pandas.pydata.org/

their is now an ambiguous option to do this (which has several options for it)

In [16]: x.tz_localize(tz,ambiguous=True)
Out[16]: Timestamp('2014-10-26 02:00:00+0200', tz='Europe/Berlin')

In [17]: x.tz_localize(tz,ambiguous='NaT')
Out[17]: NaT

In [18]: x.tz_localize(tz,ambiguous=False)
Out[18]: Timestamp('2014-10-26 02:00:00+0100', tz='Europe/Berlin')

cc @rockg cc @ischwabacher

gbakalian commented 10 years ago

I have updated the version - your example work fine. But i still have an issue when trying to resample:

import pandas as pd
from datetime import datetime as dt
import pytz

tz = 'Europe/Berlin'
d = dt(2014,10,25,03,00)
tz = pytz.timezone(tz)
dates = pd.date_range(tz.localize(d), periods=24, freq='H')
pd.Series(5,dates).resample('H')
gbakalian commented 10 years ago

How can I 'set' ambiguous equal True in my previous example ?

gbakalian commented 10 years ago

Also:

dates = pd.date_range(dt(2014,10,25,03,00), periods=24, freq='H')
ts = pd.Series(5,dates)
ts.tz_localize(tz, ambiguous=True)
jreback commented 10 years ago

@gbakalian out of curiosity why are you resampling over a DST transition with hourly data?

gbakalian commented 10 years ago

LOL indeed can sounds weird !

In my database I received several curves with different frequencies and different timezones from different weather providers. I need to get them at the same time zone and resample them hourly.

but it can happen that among all the curves I retrieve some which are already hourly.

if you prefer the following example also crash :

import pandas as pd
from datetime import datetime as dt
import pytz

tz = 'Europe/Berlin'
d = dt(2014,10,25,04,00)
tz = pytz.timezone(tz)
dates = pd.date_range(tz.localize(d), periods=12, freq='2H')
pd.Series(5,dates).resample('H')
jreback commented 10 years ago

thanks! that sounds like a good use case.

if you'd like to dig-in would welcome a PR to fix! It prob just not passing the argument thru

gbakalian commented 10 years ago

May I ask you a question:

The ambiguous parameters is set in which element ? - and is it kept ? (sorry if I am asking dumb questions)

because in the previous example I do not tell him anywhere that I want ambiguous=True, as I made the tz localisation using pytz and not tz_localize on pd.Series. So maybe we should be able to pass ambiguous through resample ?

dates = pd.date_range(d, periods=12, freq='2H')
ts = pd.Series(5, dates)
ts.tz_localize(tz, ambiguous=True)

this does not work neither so I guess in tz_localize_to_utc in the tslib.pyx file the ambiguous parameters is not passed or read correctly ?

Also I guess the 4 tz_localize between lines 418-428 in the index.py file should have ambiguous=ambiguous.

jreback commented 10 years ago

that sounds about right. here's some help for doing a pull-request: https://github.com/pydata/pandas/wiki

this prob has several aspects which are not handling this: date_range (manifesting in _generate) as you have seen, and resample so create appropriate tests (you can use the examples above).

have it fail, then fix.

rockg commented 10 years ago

True/False are not valid arguments to ambiguous when dealing with a DatetimeIndex. Use infer instead.

rockg commented 10 years ago

If you want to explicitly tell what times are ambiguous with multiple DST switches you need to pass in an array of True/False values.

Now, as to passing in ambiguous into the start/end times I think the right way to handle this is to pass in a non-ambiguous time into the constructor of the DatetimeIndex and then when it's localized further below everything should work properly. Otherwise you will maybe have multiple arguments that need to be passed with ambiguous.

rockg commented 10 years ago

Passing True/False into ambiguous is ambiguous in itself. I agree that the naming in the Timestamp case is a bit misleading but in the DatetimeIndex case it should not be. We tried to be as explicit as possible by forcing the user to fully specify DST/non-DST flags or using the infer/raise/NaT arguments to specify how those ambiguous times should be handled. In the case of the Timestamp, True/False are analogous to the DST/non-DST flags of the DatetimeIndex (and you will see that they are cast to an array when passed into the underlying localize function).

ischwabacher commented 10 years ago

You should not need an ambiguous= argument to resample here, since you're starting with a time zone aware Timeseries. The ambiguous= keyword is supposed to be used any time a naive time is being converted to an aware one; the fact that this is happening in resample indicates that aware times are being silently converted to naive ones, which is always a bug. (Grarrr!)

Passing True/False into ambiguous is ambiguous in itself.

Yeah, I've been kicking myself for fighting so hard for that name; I was forgetting that practicality beats purity. ambiguous=True is supposed to mean that ambiguous times are DST, except in circumstances involving political upheaval. But those exceptions are so rare that we should probably just use is_dst as the name, with a warning in the docs about the exceptions. Is rc1 too late to deprecate one kwarg and replace it with another?

rockg commented 10 years ago

I like is_dst for Timestamps but ambiguous for everything else makes a lot of sense.

jreback commented 10 years ago

I am ok with de-ambiguitizing (is that a word?) the arguments, and bringing back is_dst and leaving ambiguous. (that was the point, we needed both?) If one of you wants to do a pr real quick should be ok.

The resample issue (this issue) can wait for 0.15.1. its prob just doing something dumb somewhere.

rockg commented 10 years ago

Well there really are two uses. When you have one timestamp we should just try and replicate what tz.localize does in pytz. So that can be done in Timestamp's constructor/tz_localize. Ambiguous should remain for DatetimeIndex and is_dst should not be used.

I'm out of dodge for awhile so somebody else would have to do it.

ischwabacher commented 10 years ago

I really like having a consistent interface, though. Maybe the best answer is to use a categorical? Then you would have ambiguous='dst' or ambiguous='std' or ambiguous=Categorical(..., labels=['dst', 'std']). That would be less ambiguous without having to write is_dst='raise'.

jreback commented 10 years ago

currently

ambiguous : ‘infer’, bool-ndarray, ‘NaT’, default ‘raise’
- ‘infer’ will attempt to infer fall dst-transition hours based on order
- bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)
- ‘NaT’ will return NaT where there are ambiguous times
- ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times

you are refering to the bool-array which has a certain 'convention' yes? I suppose you could also accept a Categorical, but maybe I am not understanding the case here

jreback commented 10 years ago

@ischwabacher you want to accept 'std'|'dst' as well which are basically infer_dst=False|True (respecitvely)

ischwabacher commented 10 years ago

Yes. It would just be for clarity; IIUC a Categorical with 2 levels is just a bool array with a little extra overhead. But maybe I'm trying to be Ruby here with my alternate spellings.

jreback commented 10 years ago

haha, no you can accept a Categorical as well!

ok, sounds like we can do this for 0.15.1 then (as you don't need to bring back infer_dst)?

@ischwabacher why don't you create a new issue for this change (if you want to add in dst|std for 0.15.0) ok as well, create an issue in any event

rockg commented 10 years ago

I still don't understand what the point of dst and std are. infer_dst=False is the default behavior where nothing is done which will raise when times are ambiguous and infer_dst=True is done with ambiguous='infer'. I think that Timestamps are fundamentally different than DatetimeIndexes as they have only one behavior and the infer|raise|NaT is invalid. We did kind of push the is_dst flags through the backend with ambiguous but the argument in the other thread was that this is really refined manipulation and didn't deserve its own keyword.

I completely agree that Categorical is too involved.

rockg commented 10 years ago

Okay, I think I understand. For Timestamps, ambiguous=dst|std would determine if the hour is dst or not. I think that's fine, but for DatetimeIndex I don't think these should be allowed, but maybe I'm missing how these would be use.

ischwabacher commented 10 years ago

I think that Timestamps are fundamentally different than DatetimeIndexes as they have only one behavior and the infer|raise|NaT is invalid.

Are there no vectorized operations on Timestamps that aren't mediated by a DatetimeIndex? I think raise and to a lesser extent NaT make sense with a single Timestamp, and what's important is that we have exactly one interface for converting naive times into aware ones. So actually maybe infer_dst does belong as a separate kwarg to DatetimeIndex.tz_localize. Doubts! Why must they show up now and not earlier?

I completely agree that Categorical is too involved.

I posted that and was immediately confronted with the horrible vision of Series.map coexisting with Series.collect... :/

rockg commented 10 years ago

I was thinking when constructing a single Timestamp, one should always know whether it's a dst or non-dst Timestamp and so it would never make sense to have those other keywords. However, when doing vectorized operations then I guess it does make sense. I still am not sold on having more than argument. I don't know of a case that we can't handle with the current one. I like your new naming for Timestamp but I don't see how that would apply to a DatetimeIndex (unless you have an array of 'dst', 'std' but that seems overkill). We could do something like ambiguous={'is_dst': [True, False, ...]} to make it more explicit.

ischwabacher commented 10 years ago

Solution: named constants pd.tslib.DST = True, pd.tslib.STD = False. Maybe that's a BoolEnum when enum is available. Then it becomes ambiguous=DST and everybody's happy.

rockg commented 10 years ago

I'm happy with that.

jorisvandenbossche commented 10 years ago

@ischwabacher I don't understand your last point on named constants. Can you give an example of how that would look like? What is the advantage of using ambiguous=pd.tslib.DST over just using ambiguous=True (and documenting clearly that True means dst) or ambiguous='dst'?

ischwabacher commented 10 years ago

I want to make the parameter grokkable. If you read is_dst=True|False|'infer' in the docstring, you know what that kwarg does without reading any further. If 'NaT' and 'raise' are also possibilities, you might need to read the next sentence, but it's still pretty clear. The problem is that the corner cases are unintuitive; either is_dst means what it says and some ambiguities can't be resolved (à la pytz), or is_dst actually means something slightly different and the corner cases are actively counterintuitive.

On the other hand, ambiguous=True isn't grokkable, and it still doesn't say what it means ("use the first aware time that has this naive representation). There's no way the user will guess the correct incantation without reading the docs. The only real advantage of ambiguous=DST over ambiguous='dst' is that the former can share code with the ndarray of bools case via broadcasting, but both of these suffer from the same problem as is_dst=True.

I'm also reconsidering the use case for an ndarray of bools in the case where is_dst does actually go the wrong way (i.e. the spring transition is a "fall back" for geopolitical reasons), since here it's awkward to expect the user to flip the ambiguous flag; you get an ambiguous array that looks like this:

                          actually STD            actually DST
                    +---------------------+  +--------------------+
ambiguous = Series([..., False, False, True, False, True, True, ...])
                                         ^     ^
                                  only these two matter

If we're dealing with vectorized operations in such cases, we probably want either 'infer' or an ndarray that actually specifies which times are DST and which aren't.

On the other hand, my original argument for ambiguous over is_dst still applies-- we don't want to be stuck with unresolvable ambiguities when there's a time zone discontinuity that isn't a DST boundary. But in that case, maybe it's just best to force the user to do this:

timestr = '2015-03-08 02:00:00'
naive = pd.Timestamp(timestr)
fixed = naive.tz_localize(pytz.FixedOffset(-7*60))
aware = fixed.tz_convert('Left_Coastia/Chicago')
worthy7 commented 1 year ago

Is this related to this bug? image

I am getting some data from the populate MetaTrader 5 platform, the data is in Cyprus time, which has DST. The clocks went forward on March 12th this year, I expected the data from there to be +3 it but remains +2.

Passing in GMT is the same, It all sticks in +0 instead of moving forward.