Lost timezone info when reading from cache

JrtPec commented 8 years ago

TL;DR: There is a bug with iso8601. It is fixed, but they haven't published the fix on pypi yet.

Here I go again with the long story of a wild bug chase:

By default, pandas.read_csv() does not "really" support timezone-aware timestamps. I'm using double quotes because Pandas does convert them to UTC and returns them as timezone-naive. In caching.py, @saroele has added the line df.index = df.index.tz_localize('UTC'), so technically you are returned the correct time, albeit in UTC instead of the timezone you used when you did the caching.

I'm caching weather data per day, my timestamps look something like 2016-04-20 00:00:00+02:00, and after caching they look like 2016-04-19 22:00:00+00:00. This gives me really stupid errors and a lot of headaches when I try to compare dates.

The solution is easy: instead of using the default pandas date parser, you can set the date_parser argument in pandas.read_csv() to use the iso8601 library, like so: pandas.read_csv(date_parser=iso8601.parse_date). This works great: it parses the timezone-aware timestamp and uses a FixedOffset to represent the timezone.

However, when I try .truncate() or .loc() on the resulting frame, iso8601 gets stuck in an infinite loop... a bug that was fixed on 2015-11-18 and will be included in the NEXT RELEASE!

So I'm writing all this down so that I don't forget to check the iso8601 pypi page someday in the future to check if they have released version 0.1.12... In the meantime I'll figure out some workaround... bleeurg I hate timezones

JrtPec commented 8 years ago

Or, instead of using the iso8601 package I could just use dateutil.parser.parse.

I need a drink.

saroele commented 8 years ago

and you deserve one +1 for bleeurg I hate timezones

The question is: what is the desired behaviour if you ask a cached consumption of 20/04/2016 ? I agree that you expect to get the data of that day, in local time. Timezones and dates get messed up as you have illustrated in your example above.

An often used principle is to store everything in UTC. This works for eg. hourly timeseries. However, caching daily data in UTC may have been a wrong idea, and the datestamps (and data) should be taken according to local time... I guess we can discuss on grid:camp :-)

JrtPec commented 8 years ago

You are correct about the desired behaviour, this is exactly what Forecast.io does. It returns its 'daily' report with a timestamp at midnight, locally. If you store this in UTC you can get a date mismatch, so you'd need to localise it again, but from what? The timezone information has been thrown away.

So there are two solutions: do a tz-aware caching, or store the desired timezone for each site in the Houseprint (which will bring along a whole different set of issues I'm sure)

saroele commented 8 years ago

We need to combine both solutions. Timezone is a sensor characteristic. Thank god we work with buildings and not container tracking sensors :-) So we need to have it at sensor level, and then for redundancy reasons, we can include it in the timestamps in the cache.

On Thu, Apr 21, 2016 at 11:32 AM, Jan Pecinovsky notifications@github.com wrote:

You are correct about the desired behaviour, this is exactly what Forecast.io does. It returns its 'daily' report with a timestamp at midnight, locally. If you store this in UTC you can get a date mismatch, so you'd need to localise it again, but from what? The timezone information has been thrown away.

So there are two solutions: do a tz-aware caching, or store the desired timezone for each site in the Houseprint (which will bring along a whole different set of issues I'm sure)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/opengridcc/opengrid/issues/135#issuecomment-212830128

JrtPec commented 8 years ago

The saga continues:

A Pandas DatetimeIndex only allows OR a fully localised index (eg. 'UTC' or 'Europe/Brussels') OR an index with a Fixed Offset (eg. +02:00), for all timestamps.

So when we cache a localised index where a change from/to DST happens, the parser cannot convert them back to a tz-aware index, because the offset changes from +01:00 to +02:00.

So this might be the final straw for me, and I'm inclined to say that we do have to save everything in UTC and add a timezone field to the houseprint...

JrtPec commented 8 years ago

I have a solution: instead of saving to CSV and having to go through parsing everything, why don't we just save to pickle? That way we're sure the data is read from cache in exactly the same format as how we've written it in the first place.

saroele commented 8 years ago

the advantage of the csv is that it also happens to be useful for excel champions and import in other tools or platforms. With pickle, we loose this again. Maybe the pandas-json-pandas route works? And json is somewhat more universal than pickle.

On Thu, Apr 21, 2016 at 3:18 PM, Jan Pecinovsky notifications@github.com wrote:

I have a solution: instead of saving to CSV and having to go through parsing everything, why don't we just save to pickle? That way we're sure the data is read from cache in exactly the same format as how we've written it in the first place.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/opengridcc/opengrid/issues/135#issuecomment-212915864

JrtPec commented 8 years ago

Caching to JSON presents the same behaviour as to CSV. I expect every format where Pandas has to do some parsing to behave the same.

opengridcc / opengrid-dev

Lost timezone info when reading from cache #135