glamod / glamod-ingest

Database preparation and ingestion for GLAMOD
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Timezone in the `date_time` column - is not always "+00:00" (UTC) - but querying does not work #73

Closed agstephens closed 3 years ago

agstephens commented 3 years ago

We have found, particularly in lots of old data, that the date_time column includes timezones that are not UTC.

For example:


cdm=> select station_name, date_time from lite_2_0.observations_1755_land_0 WHERE date_time BETWEEN '1755-01-01 06:00:00'::timestamptz AND '1755-01-03 07:00:00+00:00'::timestamptz;
 station_name |          date_time
--------------+------------------------------
 BASEL        | 1755-01-01 06:27:45-00:01:15
 BASEL        | 1755-01-01 11:27:45-00:01:15
 BASEL        | 1755-01-02 02:27:45-00:01:15
 BASEL        | 1755-01-02 06:27:45-00:01:15
 BASEL        | 1755-01-02 11:27:45-00:01:15
 BASEL        | 1755-01-03 02:27:45-00:01:15
 BASEL        | 1755-01-03 06:27:45-00:01:15
(7 rows)

Here are the issues...:

  1. The current query mechanism does not find this data.
  2. Those timezone values (-00:01:15) look like they might be incorrect - please check the source data - is this correct?

The CDS team (Hans/Gionata) have said:

  1. We should return only UTC data to the user.
  2. We might need a new column - with cleaned UTC-fixed data.
  3. Should we be fixing this issue when we restructure the data? Or in the source data?
  4. Where is this issue coming from? How much of the data does it cover?
rjhd2 commented 3 years ago

Checking the this file /gws/nopw/j04/c3s311a_lot2/data/level2/land/r202005/cdm_lite/sub_daily/CDM_lite_SecondRelease_SZUbasel-20.psv.gz

this is the first line (apologies for the wrapping):

SZUbasel-20-1-1755-01-01-06:29-57-12|0|1755-01-01 06:29:00+00|1|47.56|7.58|2|57|032|99248|12|0||1|SZUbasel-20|BASEL|0|1

The timestamp in the above 1755-01-01 06:29:00+00 does match what you have in the ticket 1755-01-01 06:27:45-00:01:15. For some reason 06:29:00 is being presented as 06:27:45 with 00:01:15 offset. If these are all very early dates, and we are showing times down to the second, would leap seconds have anything to do with it.

It's my understanding that all times will be UTC when converted to IFF, so although there is space for at timezone, this shouldn't be filled.

agstephens commented 3 years ago

It looks like these offsets are not injected by the restructure code...

$ python scripts/land/restructure-land.py -r r2.0 -y 1755 -b sub_daily-CDM_lite_SecondRelease_SZUbasel-^C$ more /work/scratch-nopw/astephen/glamod/r2.0/cdmlite/prepare/land/0/1755/0-1755-sub_daily-CDM_lite_SecondRelease_SZUbasel-.psv | cut -d\| -f3 | cut -d\+ -f2 | sort -u 
0000
date_time

So, that is not the problem. We need to investigate further.

agstephens commented 3 years ago
$ grep -i "SZUbasel-20-1-1755-01-01-06:29-57-12" /work/scratch-nopw/astephen/glamod/r2.0/cdmlite/prepare/land/0/1755/0-1755-sub_daily-CDM_lite_SecondRelease_SZUbasel-.psv 
SZUbasel-20-1-1755-01-01-06:29-57-12|1|1755-01-01 06:29:00+0000|1|0|7.580|47.560|0|2|57|32|99248|12|NULL|1|SZUbasel-20|BASEL|0|279|SRID=4326;POINT(7.580 47.560)
agstephens commented 3 years ago

The plan to fix this is:

  1. Change the database structure/schema so that date_time is of type timestamp (without timezone) (see: https://www.postgresql.org/docs/11/datatype-datetime.html)
  2. Change all triggers, partitions, constraints to match (1)
  3. When parsing the input data:
    • Parse date-times with timezone
    • Assert all timezones are "+00" (i.e. UTC) - else raise Exception
    • Transform date-times to remove timezone
    • Export PSV file with simple date-times (no timezone)

@jhaigh0: please prototype some functions, using a pandas DataFrame to do:

  1. Check timezones - of a given column in a DataFrame
  2. Remove timezones - from a given date-time column in a DataFrame
agstephens commented 3 years ago

Further investigation into why this was writing the wrong datetime in the DB...

Input data (PSV file):

$ cat test.psv 

observation_id|data_policy_licence|date_time|date_time_meaning|observation_duration|longitude|latitude|report_type|height_above_surface|observed_variable|units|observation_value|value_significance|platform_type|station_type|primary_station_id|station_name|quality_flag|source_id|location

ZZZbasel-20-1-1755-01-01-06:29-57-12|1|1755-01-01 06:29:00+0000|1|0|7.580|47.560|0|2|57|32|99248|12|NULL|1|ZZZbasel-20|ZZZBASEL|0|279|SRID=4326;POINT(7.580 47.560)

Write to table lite_2_0.observations_1755_land_0 with:

$ $PSQL_PREFIX -c "\COPY lite_2_0.observations_1755_land_0 FROM 'test.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'"
COPY 1

Show how the date_time value in the DB:

$ $PSQL_PREFIX -t -c "SELECT date_time FROM lite_2_0.observations_1755_land_0 WHERE observation_id LIKE 'ZZZ%';"
 1755-01-01 06:27:45-00:01:15

Delete that record again with:

$ $PSQL_PREFIX -c "DELETE FROM lite_2_0.observations_1755_land_0 WHERE observation_id LIKE 'ZZZ%';"
DELETE 1

Summary: in 1755, there is a strange error introduced when loading the date_time: 1755-01-01 06:29:00+0000

agstephens commented 3 years ago

So, next investigation, why does loading this into Postgres cause this conversion:

1755-01-01 06:29:00+0000 --> 1755-01-01 06:27:45-00:01:15

Let's try some others and record the results:

1755-01-01 06:29:00 --> 1755-01-01 06:29:00-00:01:15
1755-01-01 06:29:00-00:01:00 --> 1755-01-01 06:28:45-00:01:15

This issue is discussed here: https://github.com/dotnet/runtime/issues/11718

agstephens commented 3 years ago

Looks like this is an issue with the timezone setting in Postgres, see:

https://stackoverflow.com/questions/6663765/postgres-default-timezone

agstephens commented 3 years ago

@jhaigh0 this issue may be relevant to the issue you are looking at.

agstephens commented 3 years ago

So, it turns out that the TIMEZONE setting in Postgres is key. I have altered it on the server:

$ vi /var/database/data/postgresql.conf
$ systemctl restart postgresql-11.service
$ grep UTC /var/database/data/postgresql.conf
timezone = 'UTC'

After, that I get these results:

$ cat test.psv  | grep 1755
ZZZbasel-20-1-1755-01-01-06:29-57-12|1|1755-01-01 06:29:00+0000|1|0|7.580|47.560|0|2|57|32|99248|12|NULL|1|ZZZbasel-20|ZZZBASEL|0|279|SRID=4326;POINT(7.580 47.560)
$ $PSQL_PREFIX -c "\COPY lite_2_0.observations_1755_land_0 FROM 'test.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'"
COPY 1
$ $PSQL_PREFIX -t -c "SELECT date_time FROM lite_2_0.observations_1755_land_0 WHERE observation_id LIKE 'ZZZ%';"
 1755-01-01 06:29:00+00
agstephens commented 3 years ago

So, the solution is:

1. To fix new data

Set the TIMEZONE in Postgres configuration:

$ grep UTC /var/database/data/postgresql.conf
timezone = 'UTC'

2. To fix existing DB tables

To be fixed in:

https://github.com/glamod/glamod-ingest/issues/75

agstephens commented 3 years ago

Do not close until we have tested playbook works.

agstephens commented 3 years ago

Playbook tested, all good.