Closed gbakalian closed 6 years ago
try this using 0.15.0rc1 http://pandas.pydata.org/
their is now an ambiguous
option to do this (which has several options for it)
In [16]: x.tz_localize(tz,ambiguous=True)
Out[16]: Timestamp('2014-10-26 02:00:00+0200', tz='Europe/Berlin')
In [17]: x.tz_localize(tz,ambiguous='NaT')
Out[17]: NaT
In [18]: x.tz_localize(tz,ambiguous=False)
Out[18]: Timestamp('2014-10-26 02:00:00+0100', tz='Europe/Berlin')
cc @rockg cc @ischwabacher
I have updated the version - your example work fine. But i still have an issue when trying to resample:
import pandas as pd
from datetime import datetime as dt
import pytz
tz = 'Europe/Berlin'
d = dt(2014,10,25,03,00)
tz = pytz.timezone(tz)
dates = pd.date_range(tz.localize(d), periods=24, freq='H')
pd.Series(5,dates).resample('H')
How can I 'set' ambiguous equal True in my previous example ?
Also:
dates = pd.date_range(dt(2014,10,25,03,00), periods=24, freq='H')
ts = pd.Series(5,dates)
ts.tz_localize(tz, ambiguous=True)
@gbakalian out of curiosity why are you resampling over a DST transition with hourly data?
LOL indeed can sounds weird !
In my database I received several curves with different frequencies and different timezones from different weather providers. I need to get them at the same time zone and resample them hourly.
but it can happen that among all the curves I retrieve some which are already hourly.
if you prefer the following example also crash :
import pandas as pd
from datetime import datetime as dt
import pytz
tz = 'Europe/Berlin'
d = dt(2014,10,25,04,00)
tz = pytz.timezone(tz)
dates = pd.date_range(tz.localize(d), periods=12, freq='2H')
pd.Series(5,dates).resample('H')
thanks! that sounds like a good use case.
if you'd like to dig-in would welcome a PR to fix! It prob just not passing the argument thru
May I ask you a question:
The ambiguous parameters is set in which element ? - and is it kept ? (sorry if I am asking dumb questions)
because in the previous example I do not tell him anywhere that I want ambiguous=True, as I made the tz localisation using pytz and not tz_localize on pd.Series. So maybe we should be able to pass ambiguous through resample ?
dates = pd.date_range(d, periods=12, freq='2H')
ts = pd.Series(5, dates)
ts.tz_localize(tz, ambiguous=True)
this does not work neither so I guess in tz_localize_to_utc in the tslib.pyx file the ambiguous parameters is not passed or read correctly ?
Also I guess the 4 tz_localize between lines 418-428 in the index.py file should have ambiguous=ambiguous.
that sounds about right. here's some help for doing a pull-request: https://github.com/pydata/pandas/wiki
this prob has several aspects which are not handling this: date_range
(manifesting in _generate
) as you have seen, and resample
so create appropriate tests (you can use the examples above).
have it fail, then fix.
True/False are not valid arguments to ambiguous when dealing with a DatetimeIndex. Use infer
instead.
If you want to explicitly tell what times are ambiguous with multiple DST switches you need to pass in an array of True/False values.
Now, as to passing in ambiguous into the start/end times I think the right way to handle this is to pass in a non-ambiguous time into the constructor of the DatetimeIndex and then when it's localized further below everything should work properly. Otherwise you will maybe have multiple arguments that need to be passed with ambiguous.
Passing True/False into ambiguous is ambiguous in itself. I agree that the naming in the Timestamp case is a bit misleading but in the DatetimeIndex case it should not be. We tried to be as explicit as possible by forcing the user to fully specify DST/non-DST flags or using the infer/raise/NaT arguments to specify how those ambiguous times should be handled. In the case of the Timestamp, True/False are analogous to the DST/non-DST flags of the DatetimeIndex (and you will see that they are cast to an array when passed into the underlying localize function).
You should not need an ambiguous=
argument to resample here, since you're starting with a time zone aware Timeseries
. The ambiguous=
keyword is supposed to be used any time a naive time is being converted to an aware one; the fact that this is happening in resample
indicates that aware times are being silently converted to naive ones, which is always a bug. (Grarrr!)
Passing True/False into ambiguous is ambiguous in itself.
Yeah, I've been kicking myself for fighting so hard for that name; I was forgetting that practicality beats purity. ambiguous=True
is supposed to mean that ambiguous times are DST, except in circumstances involving political upheaval. But those exceptions are so rare that we should probably just use is_dst
as the name, with a warning in the docs about the exceptions. Is rc1 too late to deprecate one kwarg and replace it with another?
I like is_dst
for Timestamps but ambiguous
for everything else makes a lot of sense.
I am ok with de-ambiguitizing (is that a word?) the arguments, and bringing back is_dst
and leaving ambiguous
. (that was the point, we needed both?) If one of you wants to do a pr real quick should be ok.
The resample issue (this issue) can wait for 0.15.1. its prob just doing something dumb somewhere.
Well there really are two uses. When you have one timestamp we should just try and replicate what tz.localize does in pytz. So that can be done in Timestamp's constructor/tz_localize. Ambiguous should remain for DatetimeIndex and is_dst should not be used.
I'm out of dodge for awhile so somebody else would have to do it.
I really like having a consistent interface, though. Maybe the best answer is to use a categorical? Then you would have ambiguous='dst'
or ambiguous='std'
or ambiguous=Categorical(..., labels=['dst', 'std'])
. That would be less ambiguous without having to write is_dst='raise'
.
currently
ambiguous : ‘infer’, bool-ndarray, ‘NaT’, default ‘raise’
- ‘infer’ will attempt to infer fall dst-transition hours based on order
- bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)
- ‘NaT’ will return NaT where there are ambiguous times
- ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times
you are refering to the bool-array
which has a certain 'convention' yes?
I suppose you could also accept a Categorical
, but maybe I am not understanding the case here
@ischwabacher you want to accept 'std'|'dst'
as well which are basically infer_dst=False|True
(respecitvely)
Yes. It would just be for clarity; IIUC a Categorical
with 2 levels is just a bool array with a little extra overhead. But maybe I'm trying to be Ruby here with my alternate spellings.
haha, no you can accept a Categorical
as well!
ok, sounds like we can do this for 0.15.1 then (as you don't need to bring back infer_dst
)?
@ischwabacher why don't you create a new issue for this change (if you want to add in dst|std
for 0.15.0) ok as well, create an issue in any event
I still don't understand what the point of dst
and std
are. infer_dst=False
is the default behavior where nothing is done which will raise when times are ambiguous and infer_dst=True
is done with ambiguous='infer'
. I think that Timestamps are fundamentally different than DatetimeIndexes as they have only one behavior and the infer|raise|NaT
is invalid. We did kind of push the is_dst flags through the backend with ambiguous but the argument in the other thread was that this is really refined manipulation and didn't deserve its own keyword.
I completely agree that Categorical is too involved.
Okay, I think I understand. For Timestamps, ambiguous=dst|std
would determine if the hour is dst or not. I think that's fine, but for DatetimeIndex I don't think these should be allowed, but maybe I'm missing how these would be use.
I think that Timestamps are fundamentally different than DatetimeIndexes as they have only one behavior and the
infer|raise|NaT
is invalid.
Are there no vectorized operations on Timestamp
s that aren't mediated by a DatetimeIndex
? I think raise
and to a lesser extent NaT
make sense with a single Timestamp
, and what's important is that we have exactly one interface for converting naive times into aware ones. So actually maybe infer_dst
does belong as a separate kwarg to DatetimeIndex.tz_localize
. Doubts! Why must they show up now and not earlier?
I completely agree that Categorical is too involved.
I posted that and was immediately confronted with the horrible vision of Series.map
coexisting with Series.collect
... :/
I was thinking when constructing a single Timestamp, one should always know whether it's a dst or non-dst Timestamp and so it would never make sense to have those other keywords. However, when doing vectorized operations then I guess it does make sense. I still am not sold on having more than argument. I don't know of a case that we can't handle with the current one. I like your new naming for Timestamp but I don't see how that would apply to a DatetimeIndex (unless you have an array of 'dst', 'std' but that seems overkill). We could do something like ambiguous={'is_dst': [True, False, ...]}
to make it more explicit.
Solution: named constants pd.tslib.DST
= True
, pd.tslib.STD
= False
. Maybe that's a BoolEnum
when enum
is available. Then it becomes ambiguous=DST
and everybody's happy.
I'm happy with that.
@ischwabacher I don't understand your last point on named constants. Can you give an example of how that would look like?
What is the advantage of using ambiguous=pd.tslib.DST
over just using ambiguous=True
(and documenting clearly that True means dst) or ambiguous='dst'
?
I want to make the parameter grokkable. If you read is_dst=True|False|'infer'
in the docstring, you know what that kwarg does without reading any further. If 'NaT'
and 'raise'
are also possibilities, you might need to read the next sentence, but it's still pretty clear. The problem is that the corner cases are unintuitive; either is_dst
means what it says and some ambiguities can't be resolved (à la pytz), or is_dst
actually means something slightly different and the corner cases are actively counterintuitive.
On the other hand, ambiguous=True
isn't grokkable, and it still doesn't say what it means ("use the first aware time that has this naive representation). There's no way the user will guess the correct incantation without reading the docs. The only real advantage of ambiguous=DST
over ambiguous='dst'
is that the former can share code with the ndarray
of bools
case via broadcasting, but both of these suffer from the same problem as is_dst=True
.
I'm also reconsidering the use case for an ndarray
of bool
s in the case where is_dst
does actually go the wrong way (i.e. the spring transition is a "fall back" for geopolitical reasons), since here it's awkward to expect the user to flip the ambiguous
flag; you get an ambiguous
array that looks like this:
actually STD actually DST
+---------------------+ +--------------------+
ambiguous = Series([..., False, False, True, False, True, True, ...])
^ ^
only these two matter
If we're dealing with vectorized operations in such cases, we probably want either 'infer'
or an ndarray
that actually specifies which times are DST and which aren't.
On the other hand, my original argument for ambiguous
over is_dst
still applies-- we don't want to be stuck with unresolvable ambiguities when there's a time zone discontinuity that isn't a DST boundary. But in that case, maybe it's just best to force the user to do this:
timestr = '2015-03-08 02:00:00'
naive = pd.Timestamp(timestr)
fixed = naive.tz_localize(pytz.FixedOffset(-7*60))
aware = fixed.tz_convert('Left_Coastia/Chicago')
Is this related to this bug?
I am getting some data from the populate MetaTrader 5 platform, the data is in Cyprus time, which has DST. The clocks went forward on March 12th this year, I expected the data from there to be +3
it but remains +2.
Passing in GMT is the same, It all sticks in +0 instead of moving forward.
Works with datetime.datetime, does not work with pandas.tslib.Timestamp. Encounter the issue when trying to resample hourly a series.