glamod / glamod-ingest

Database preparation and ingestion for GLAMOD
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Experiment with Postgres to find out how to convert all times to UTC #75

Closed agstephens closed 3 years ago

agstephens commented 3 years ago

In the current r2.0 contents of the GLAMOD DB, we have some strange date/times that include a time zone. Where they have been read in by Pandas they were somehow modified. This relates to:

https://github.com/glamod/glamod-ingest/issues/73

What we need to do with the existing Postgres content is:

  1. Change column type from: timestamp with timezone to timestamp without timezone (https://www.postgresql.org/docs/11/datatype-datetime.html)
  2. Fix the datetime values so that they are correct, i.e.: if there is a timezone value given - then convert it to UTC.

@jhaigh0 Please can you create you own test Postgres DB and create a simple table with a single column date_time - populate it with some values and then do some googling and testing to find out the best way for us to do the datetime conversion so that we end up removing the timezone using a simple column type: timestamp without timezone

Thanks

jhaigh0 commented 3 years ago

Just going to compile some links here.

agstephens commented 3 years ago

@jhaigh0 looks promising!

agstephens commented 3 years ago

Possible test/solution, try writing a script of SQL commands that you can run with psql ... -f test-script.sql:

  1. Set timezone 'London/Europe'
  2. Create table
  3. Create record with offset
  4. Select all - to view record
  5. Set timezone 'UTC'
  6. Alter column to remove timezone OR do it a different way
  7. Select all - to view the record
  8. Drop the table
agstephens commented 3 years ago

@jhaigh0: thanks for looking into this.

I have just run a "SELECT" command as mentioned in #73. The funny thing is, now we have (re-)set the timezone, the result is actually different. It actually looks like setting UTC fixes the queries anyway:

cdm=> SHOW timezone; 
TimeZone 
----------
 UTC
(1 row)

cdm=> select station_name, date_time from lite_2_0.observations_1755_land_0 WHERE date_time BETWEEN '1755-01-01 06:00:00'::timestamptz AND '1755-01-03 07:00:00+00:00'::timestamptz LIMIT 1;
 station_name |       date_time        
--------------+------------------------
 BASEL        | 1755-01-01 06:29:00+00
(1 row)

Please see the top-panel in #73 to see how this is different to query I ran then (with TimeZone: 'Europe/London'

So, the situation might be as simple as altering the column type. Please can you test that this will work if we alter the column type to remove the time zone. Hopefully it simplifies your test even more. Please document your testing here. Thanks

jhaigh0 commented 3 years ago

This is what I've found that seems work;

First set up a test table like the real data

abcunit_jh=> set timezone to 'Europe/London';
SET
abcunit_jh=> CREATE TABLE time_data(id serial PRIMARY KEY, dates TIMESTAMPTZ NOT NULL);
CREATE TABLE
abcunit_jh=> INSERT INTO time_data (dates) VALUES('1800-06-22 19:00:00');
INSERT 0 1
abcunit_jh=> SELECT * FROM time_data;
 id |            dates
----+------------------------------
  1 | 1800-06-22 19:00:00-00:01:15
(1 row)

Then remove timezone info

abcunit_jh=> ALTER TABLE time_data ALTER COLUMN dates TYPE timestamp;
ALTER TABLE
abcunit_jh=> SELECT * FROM time_data;
 id |        dates
----+---------------------
  1 | 1800-06-22 19:00:00
(1 row)

so we should just be able to run this alter table command on the columns that have the problem, if our end goal is to remove timezone info and just assume all is in utc. @agstephens

agstephens commented 3 years ago

Thanks @jhaigh0, that looks great. It looks like it is actually easier to solve than I thought. I'll have a think about how/whether we need to make a change for r2.0. For r3.0, we'll build it with UTC only.

agstephens commented 3 years ago

At present, the global timezone change to UTC seems to have fixed the issue that was raised by the CDS Team. I have asked them if we should completely remove the timezone from the returned data. I would propose that that would be the best option, for simplicity and future usage.

agstephens commented 3 years ago

No further action needed on this.