lpsmith / postgresql-simple

Mid-level client library for accessing PostgreSQL from Haskell
Other
206 stars 71 forks source link

Timezone information is lost with ZonedTimes #69

Closed lukehoersten closed 11 years ago

lukehoersten commented 11 years ago

The timezones CDT and EST are both -5 offsets from UTC. When serializing ZonedTimes timezones as an offset, the original timezone is lost. It looks like this is what's being done in postgresql-simple:

timeZoneToBuilder :: TimeZone -> Builder

It looks like deserialization has the same problem.

This has some practical problems. For example: I'm currently in CDT (daylight savings) so when I store a ZonedTime of the current time and then pull it back out of postgresql-simple, it's been shifted by an hour.

I'm also making the assumption that PGSQL actually stores the full timezone info (it looks like it does in the docs): http://www.postgresql.org/docs/9.2/static/datatype-datetime.html

lukehoersten commented 11 years ago

I'm using a timestamp with time zone to store a Haskell ZonedTime which I think is the same as a timestamptz.

Anyway, it shouldn't matter what local time zone PGSQL is using to internally store the timestamp as long as the original timezone (and DST) is preserved correctly. The docs say that when using a timezone abbreviation or offset, the daylight savings state needs to be specified. I can't find where that's happening in the postgresql-simple code:

In addition to the timezone names and abbreviations, PostgreSQL will accept POSIX-style time zone specifications of the form STDoffset or STDoffsetDST, where STD is a zone abbreviation, offset is a numeric offset in hours west from UTC, and DST is an optional daylight-savings zone abbreviation, assumed to stand for one hour ahead of the given offset. For example, if EST5EDT were not already a recognized zone name, it would be accepted and would be functionally equivalent to United States East Coast time. When a daylight-savings zone name is present, it is assumed to be used according to the same daylight-savings transition rules used in the zoneinfo time zone database's posixrules entry. In a standard PostgreSQL installation, posixrules is the same as US/Eastern, so that POSIX-style time zone specifications follow USA daylight-savings rules. If needed, you can adjust this behavior by replacing the posixrules file.

lpsmith commented 11 years ago

Yes, timestamptz is the same as timestamp with time zone, but no time zone information is stored on a per-value or per-column basis. All that the with time zone gets you is sane conversions to and from strings. (Honestly, there is rarely any legitimate reason to use without time zone.) When you provide an offset in the timestamp, the backend uses it to convert to UTC time, and when you retrieve a timestamptz from the database the backend will convert the timestamp to whatever the timezone connection variable specifies.

Here, I'll show you what I mean: (Note that if the offset is not specified, it is computed from the timezone connection variable, based upon the timezone's offset at that time. In most cases, you should provide the offset; postgresql-simple automatically includes an offset when sending UTCTime and ZonedTime parameters.)

$ psql
psql (9.2.4)
Type "help" for help.

lpsmith=> create temporary table timetest (x timestamp with time zone);
CREATE TABLE
lpsmith=> set timezone to 'America/New_York';
SET
lpsmith=> insert into timetest values ('2013-01-01 12:00'),('2013-06-01 12:00'); 
INSERT 0 2
lpsmith=> select * from timetest;
           x            
------------------------
 2013-01-01 12:00:00-05
 2013-06-01 12:00:00-04
(2 rows)

lpsmith=> set timezone to 'utc';
SET
lpsmith=> select * from timetest;
           x            
------------------------
 2013-01-01 17:00:00+00
 2013-06-01 16:00:00+00
(2 rows)

lpsmith=> set timezone to 'America/Chicago';
SET
lpsmith=> select * from timetest;
           x            
------------------------
 2013-01-01 11:00:00-06
 2013-06-01 11:00:00-05
(2 rows)

As for ZonedTime and timestamp with time zone, they are only equivalent in the sense that they both specify an unambiguous point in time. That's why you can convert timestamp with time zone to ZonedTime and UTCTime.

However, the meaning of LocalTime and timestamp without time zone can only be fully understood in some larger context. That's why postgresql-simple-0.2 removed the conversion from timestamp without time zone to UTCTime.

lukehoersten commented 11 years ago

Wow thanks a lot. I did not realize PGSQL worked that way. That's unfortunate. I actually need to store the timezone myself then. Thanks for explaining this!

lpsmith commented 11 years ago

Well, if you want some time trivia for future reference, here's a difference between choosing different cities in the same time zone:

$ psql
psql (9.2.4)
Type "help" for help.

lpsmith=> set timezone to 'America/New_York';
SET
lpsmith=> select '1883-11-18 11:59'::timestamptz;
         timestamptz          
------------------------------
 1883-11-18 11:59:00-04:56:02
(1 row)

lpsmith=> select '1883-11-18 12:00'::timestamptz;
      timestamptz       
------------------------
 1883-11-18 12:00:00-05
(1 row)

lpsmith=> set timezone to 'America/Indiana/Indianapolis';
SET
lpsmith=> select '1883-11-18 11:59'::timestamptz;
         timestamptz          
------------------------------
 1883-11-18 11:59:00-05:44:38
(1 row)

lpsmith=> select '1883-11-18 12:00'::timestamptz;
      timestamptz       
------------------------
 1883-11-18 12:00:00-06
(1 row)

1883-11-18 at noon is when Standard Railway Time went live, which is significant in the US at least. Very old timestamps from around the world exhibit similar behaviors based on local time standards; e.g. 1941-10-01 at midnight is when Kolkata seems to have adopted Indian Standard Time.

Postgresql-simple does not (yet?) support retrieving these timestamps, though. That will be tricky, because ZonedTime only supports offsets in an integral number of minutes.

Here's another example dealing with historical timestamps. In 1942 parts of Indiana, including Indianapolis, moved from central to eastern time, and in 2006 Indiana adopted daylight savings time:

lpsmith=> set timezone to 'America/Indiana/Indianapolis';
SET
lpsmith=> select '1942-02-09 01:59'::timestamptz;
      timestamptz       
------------------------
 1942-02-09 01:59:00-06
(1 row)

lpsmith=> select '1942-02-09 02:00'::timestamptz;
      timestamptz       
------------------------
 1942-02-09 03:00:00-05
(1 row)

lpsmith=> select '2005-06-01 12:00'::timestamptz;
      timestamptz       
------------------------
 2005-06-01 12:00:00-05
(1 row)

lpsmith=> select '2006-06-01 12:00'::timestamptz;
      timestamptz       
------------------------
 2006-06-01 12:00:00-04
chrisdone commented 11 years ago

Idea for people with time on their hands: define a zonedtime composite type: http://www.postgresql.org/docs/8.1/static/rowtypes.html and define a FromField instance for ZonedTime like that.

lukehoersten commented 11 years ago

That's essentially what I'm doing by hand now. Good suggestion. I'd be nice if we could get the PG devs to make pg_timezone_abbrev into a table instead of just a view, we could reference the timezone by ID instead of fully expanded. Also note that pg_timezone_name changes depending on the current time so stick with the _abbrev view.

ghost commented 11 years ago

What do you mean "reference the timezone by ID"? The pg_timezone_abbrev view is just a convenience wrapper over the internal pg_timezone_abbrev() function.

lukehoersten commented 11 years ago

Since the tz information isn't stored in a table with a pkey id, it can't be referenced by a forgein relation. Instead a string of the tz name, offset, and dst status need to be stored everytime the original tz is necessary.

ghost commented 11 years ago

If it were a table, the primary key would be the abbrev column, so depending on your use case you may be ok to just store that. I agree it is annoying to not be able to use a foreign key, but storing the row doesn't really solve that. pg_timezone_abbrevs is implemented as a function because you can configure it at runtime to restrict which timezones are available, if it were a table you wouldn't be able to remove timezones any more. It sounds like what you want is a snapshot of a particular state of the pg_timezone_abbrevs function stored as a table. You can already do that:

create table my_timezones as select * from pg_timezone_abbrevs;
alter table my_timezones add primary key (abbrev);
lukehoersten commented 11 years ago

Timezone abbreviations are not enough to uniquely identify timezones so using it as a primary key wouldn't make sense: http://en.wikipedia.org/wiki/List_of_time_zone_abbreviations CST has heavy overlap for example.

Taking a snapshot of the function in the manor you suggestion also wouldn't make sense because the table wouldn't reflect timezone changes over time.

If you want to discuss this further, I suggest email or Luke on IRC.

ghost commented 11 years ago

Postgresql loads timezone data from text files. It refuses to load it if there are duplicate definitions for an abbreviation. So the abbreviation is unique in the function/view in question. If it was not unique, then the primary key would be all 3 columns, so you'd be storing them all anyways.

Taking a snapshot is precisely what you would have to do to be able to have a foreign key, that's what I mean. The output of that function can change, entirely external to the database. There is no way to safely represent that as a table which you could have a foreign key to. What would happen to your references to CST when someone deletes CST from the text file? If you want to be able to reference it safely, then you have to reference a snapshot of the data as it was at a particular time. If you want to reference data that changes external to the database, then a foreign key isn't an option, you'd have to copy the offset anyways to preserve it in case the timezone is deleted. I don't think there is a way for the pg devs to make it work as a table without it being a snapshot that doesn't reflect timezone changes over time.

lpsmith commented 11 years ago

I think this is an ok place to discuss it further, either here or the database-devel mailing list. This way the conversation is easier to find later. =)

lukehoersten commented 11 years ago

Regarding using timezone abbreviations as unique keys: It doesn't matter how PG handles timezone abbreviations. They aren't unique. If I try to reference a timezone with CST, it could mean multiple timezones. Using a timezone abbreviation will never work as a unique key.

Regarding the table to facilitate a foreign key reference: Imagine a function f(time, timezone_name) which returns a timezone unique ID. Since it's a pure function, you can make it into a table. That's the table I want.

lpsmith commented 9 years ago

Oh, I recently-ish ran across timestampandtz, which is a third-party extension that implements what Luke wanted in this case. It's pretty new, and I don't know how good it is, or not.

As for reading the documentation, I strongly recommend the github repo.

Arguably, there would also be use cases for timestampandoffset as well, which would store a local timestamp and offset from UTC.

lukehoersten commented 9 years ago

I ended up using this:

CREATE TYPE timezone AS (
    utc_offset_minutes integer,
    is_dst boolean,
    name text
);

CREATE TYPE zonedtime AS (
    "timestamp" timestamp with time zone,
    timezone timezone
);

Which you're right, @lpsmith that does look like what I was looking for.

lpsmith commented 9 years ago

Well, your type does not correspond to the timestampandtz extension, rather it corresponds to Haskell's ZonedTime. Admittedly, ZonedTime is a bit misnamed, and would be more accurately called TimeWithOffset or something, although something shorter and more pithy would be nice.

The difference is that timestampandtz stores a timezone, while ZonedTime stores an offset. A timezone maps UTCTime to a particular offset, which isn't constant over time, so there's ultimately more information wrapped up in a timezone. PostgreSQL uses the IANA timezone database.

There are a couple of tradeoffs between storing a timezone versus storing an offset. I can definitely see use cases for both variants, though.