openmeteo / enhydris

A database with a web interface for the storage and management of hydro/meteorological measurements and time series
GNU Affero General Public License v3.0
16 stars 11 forks source link

Time zone mess #480

Closed aptiko closed 1 year ago

aptiko commented 1 year ago

Since we migrated to timescale, the way we treat time zones is incorrect and can result in corruption to the data.

Introduction to PostgreSQL's TIMESTAMP WITH TIME ZONE

There is no difference in how TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE are stored. Both are 8 bytes, an int64 with the number of microseconds since 2000-01-01 00:00:00. The time zone is not stored anywhere.

The only difference is in what the timestamp represents. In TIMESTAMP WITHOUT TIME ZONE, the stored timestamp represents a date and time in an unknown time zone. In TIMESTAMP WITH TIME ZONE, the stored timestamp represents a date and time in UTC. When retrieving a TIMESTAMP WITH TIME ZONE (e.g. with psql), we can use AT TIME ZONE to convert it to some other time zone, otherwise it is converted to a default time zone, such as that specified by the TZ environment variable. Likewise, when saving a TIMESTAMP WITH TIME ZONE, it is converted to UTC; the time zone from which it is converted, if not specified in the input (e.g. in a literal), is, again, some default time zone.

When a TIMESTAMP WITH TIME ZONE is retrieved into a Django DateTimeField, it is not clear which time zone it's using (maybe always UTC, or it may be affected by PostgreSQL settings and/or Django settings), but it doesn't matter anyway because, for the Enhydris use case, it has to be converted to another time zone anyway, which can be done with the .astimezone() method.

What Enhydris does

TimeseriesGroup has a time_zone attribute. When a time series is being saved from a htimeseries object (whose timestamps are naive), the timestamps are considered to be in in the timeseries group's time zone. Conversely, when a time series is being retrieved into a htimeseries object, the timestamps are converted to the timeseries group's time zone.

While this behaviour looks reasonable, it is not what was intended. TimeseriesGroup.time_zone was intended to be informational. The intention was that naive timestamps would be stored naively, and that if you stored 2022-10-10 19:30, then if you retrieved that it would be 2022-10-10 19:30. But this is not how it works. If the group's time zone is EET and you store that timestamp and then you change the group's time zone to CET and you retrieve the timestamp again, you'll get 18:30 instead. If you have an automatically updated time series with time zone MSK and one day you see it and say "hey, that's wrong!" and you change it to CET, it will continue to update the data but now the dates will mean something different.

How to fix these problems

Fixing these problems will also simplify the system much: