glamod / glamod-ingest

Database preparation and ingestion for GLAMOD
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Overwrite `data_policy_licence` values in CDMLite land data #21

Closed agstephens closed 3 years ago

agstephens commented 4 years ago

Progress checker:

For LAND data only in the CDMLite:

We need to get the primary_id and record_number as the first two components of the observation_id.

We look up the matching record in station_configuration that matches both primary_id and record_number.

From that we get the data_policy_licence value, which we then overwrite into the CDMLite record.

See full details below...

agstephens commented 4 years ago

We have a couple of small CSV files that contain station configuration settings where duplicate primary ids existed:

These separate configuration tables are ONLY FOR USE when processing for this fix. They will not be loaded into the CDM. They have been produced to overcome the issue we are having with duplicate primary_id fields across timescales. The observing_frequency field defines the timescales in the station_configuration table so it's not an issue to use the full station_configuration table already on the GWS for this release, we just need the separated tables to resolve the current data policy issues.

The fix

Part 1: Add source_id to CDM_Lite tables

1a. Update the schema to include source_id

1b. Write a function to extract the source_id from the observation_id

The observation_id can be split into:

E.g.: AFI0000OAHR-6-1973-01-01-00:00-85-12:

primary_id, record_number = observation_id.split('-', 2)[:2]

We then match these in the corresponding timescale (e.g daily/monthy OR sub-daily) station_configuration table record to find the source_id.

source_id = get_source_id_from_temporary_station_configs(primary_id, record_number, frequency)

And write that source_id into the cdm.lite.observations.source_id field in the DB.

NOTE: We only need the source_id for the product and citation lookups.

Part 2: Update the data_policy in the CDM_Lite tables

To overwrite and update the existing data_policy_licence field in the CDMLite: use the first part of the observation_id (e.g. AFI0000OAHR-6) plus the source_id to match with the combination of the primary_id and source_id in the corresponding frequency (e.g daily/monthy OR sub-daily) station configuration table to find the updated data_policy_licence value.

primary_id = observation_id.split('-')[0]
data_policy_licence = get_data_policy_licence_from_temporary_station_configs(primary_id, 
                                                                            source_id, frequency)

Then update the CDMLite record.

agstephens commented 3 years ago

@sjnoone: I realise that I had always assumed this fix should be done after the data was extracted by the web-app that queries the CDM.

Would it have been better to have to modify the data before loading it into the database?

Given that I have missed the opportunity to do that, does it make sense to run some code over the whole database to fix each record? It'll take a while but is doable.

sjnoone commented 3 years ago

Hi Ag,

I would have thought it would be best to do it before ingestion. Yep I would run the code and fix the data policy in the whole database.

Best S

Dr. Simon Noone


Postdoctoral Researcher

https://orcid.org/0000-0003-1661-1423

https://www.maynoothuniversity.ie/icarus/our-people/simon-noone

The Copernicus Climate Change Service C3S 311a Lot 2 Global Land and Marine Observations Database [cid:image001.png@01D578FE.74A75F90]

Associated expert for the WMO Commission for Climatology Working Group on Climate Data Exchange (WG-CDE)

Irish Climate Analysis and Research Units (ICARUS) Room 1.8,Laraghbryan House,

North campus, Maynooth University,

Maynooth, Kildare

Ireland

01-7086392 https://www.maynoothuniversity.ie/icarus/our-people/simon-noone [cid:image002.png@01D578FE.74A75F90] [cid:image003.png@01D578FE.74A75F90]

From: Ag Stephensmailto:notifications@github.com Sent: Thursday 19 November 2020 21:30 To: glamod/glamod-ingestmailto:glamod-ingest@noreply.github.com Cc: Simon Noonemailto:Simon.Noone@mu.ie; Mentionmailto:mention@noreply.github.com Subject: [EXTERNAL] Re: [glamod/glamod-ingest] Overwrite data_policy_licence values in CDMLite land data (#21)

Warning

This email originated from outside of Maynooth University's Mail System. Do not reply, click links or open attachments unless you recognise the sender and know the content is safe.

@sjnoonehttps://github.com/sjnoone: I realise that I had always assumed this fix should be done after the data was extracted by the web-app that queries the CDM.

Would it have been better to have to modify the data before loading it into the database?

Given that I have missed the opportunity to do that, does it make sense to run some code over the whole database to fix each record? It'll take a while but is doable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/glamod/glamod-ingest/issues/21#issuecomment-730648733, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG3VGPNB4BXYJW6JMGZKOBLSQWFAVANCNFSM4PCPTLMA.

agstephens commented 3 years ago

Thanks @sjnoone,

It's a shame I didn't spot that before. But I can get it done. Cheers

agstephens commented 3 years ago

Part 2: Update the data_policy in the CDM_Lite tables

Here is the plan for batch processing the contents of the land CDMLite, to overwrite the data_policy_licence.

NOTE: ideally, this would have been done in the restructure-land.py script.

We can re-use this code:

https://github.com/glamod/glamod-ingest/blob/master/scripts/land/restructure-land.py#L179

For each record we will:

Since we need to do this on all land partitions, we need a strategy:

sjnoone commented 3 years ago

Hi Ag,

That looks good to me.

Best S

Dr. Simon Noone


Postdoctoral Researcher

https://orcid.org/0000-0003-1661-1423

https://www.maynoothuniversity.ie/icarus/our-people/simon-noone

The Copernicus Climate Change Service C3S 311a Lot 2 Global Land and Marine Observations Database [cid:image001.png@01D578FE.74A75F90]

Associated expert for the WMO Commission for Climatology Working Group on Climate Data Exchange (WG-CDE)

Irish Climate Analysis and Research Units (ICARUS) Room 1.8,Laraghbryan House,

North campus, Maynooth University,

Maynooth, Kildare

Ireland

01-7086392 https://www.maynoothuniversity.ie/icarus/our-people/simon-noone [cid:image002.png@01D578FE.74A75F90] [cid:image003.png@01D578FE.74A75F90]

From: Ag Stephensmailto:notifications@github.com Sent: Friday 20 November 2020 08:51 To: glamod/glamod-ingestmailto:glamod-ingest@noreply.github.com Cc: Simon Noonemailto:Simon.Noone@mu.ie; Mentionmailto:mention@noreply.github.com Subject: [EXTERNAL] Re: [glamod/glamod-ingest] Overwrite data_policy_licence values in CDMLite land data (#21)

Warning

This email originated from outside of Maynooth University's Mail System. Do not reply, click links or open attachments unless you recognise the sender and know the content is safe.

Part 2: Update the data_policy in the CDM_Lite tables

Here is the plan for batch processing the contents of the land CDMLite, to overwrite the data_policy_licence.

NOTE: ideally, this would have been done in the restructure-land.py script.

We can re-use this code:

https://github.com/glamod/glamod-ingest/blob/master/scripts/land/restructure-land.py#L179

For each record we will:

Since we need to do this on all land partitions, we need a strategy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/glamod/glamod-ingest/issues/21#issuecomment-731033105, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG3VGPKPZ5XHJYPZIMQOLLLSQYUXJANCNFSM4PCPTLMA.

agstephens commented 3 years ago

@agstephens Need to make this part of restructure-land.py.

agstephens commented 3 years ago

This has now been implemented as part of the main processing chain, see:

https://github.com/glamod/glamod-ingest/blob/master/scripts/land/restructure-land.py#L227