Closed agstephens closed 3 years ago
We have a couple of small CSV files that contain station configuration settings where duplicate primary ids existed:
daily_monthly_station_config_27_10_2020.psv
sub_daily_station_config_file_30_07_20.psv
These separate configuration tables are ONLY FOR USE when processing for this fix. They will not be loaded into the CDM.
They have been produced to overcome the issue we are having with duplicate primary_id
fields across timescales. The observing_frequency
field defines the timescales in the station_configuration
table so it's not an issue to use the full station_configuration
table already on the GWS for this release, we just need the separated tables to resolve the current data policy issues.
source_id
to CDM_Lite tablessource_id
The observation_id
can be split into:
primary_id
record_number
E.g.: AFI0000OAHR-6-1973-01-01-00:00-85-12:
primary_id
: "AFI0000OAHR"record_number
: "6"primary_id, record_number = observation_id.split('-', 2)[:2]
We then match these in the corresponding timescale (e.g daily/monthy OR sub-daily) station_configuration
table record to find the source_id
.
source_id = get_source_id_from_temporary_station_configs(primary_id, record_number, frequency)
And write that source_id
into the cdm.lite.observations.source_id
field in the DB.
NOTE: We only need the source_id
for the product and citation lookups.
To overwrite and update the existing data_policy_licence
field in the CDMLite: use the first part of the observation_id
(e.g. AFI0000OAHR-6
) plus the source_id
to match with the combination of the primary_id
and source_id
in the corresponding frequency (e.g daily/monthy OR sub-daily) station configuration table to find the updated data_policy_licence
value.
primary_id = observation_id.split('-')[0]
data_policy_licence = get_data_policy_licence_from_temporary_station_configs(primary_id,
source_id, frequency)
Then update the CDMLite record.
@sjnoone: I realise that I had always assumed this fix should be done after the data was extracted by the web-app that queries the CDM.
Would it have been better to have to modify the data before loading it into the database?
Given that I have missed the opportunity to do that, does it make sense to run some code over the whole database to fix each record? It'll take a while but is doable.
Hi Ag,
I would have thought it would be best to do it before ingestion. Yep I would run the code and fix the data policy in the whole database.
Best S
Dr. Simon Noone
Postdoctoral Researcher
https://orcid.org/0000-0003-1661-1423
https://www.maynoothuniversity.ie/icarus/our-people/simon-noone
The Copernicus Climate Change Service C3S 311a Lot 2 Global Land and Marine Observations Database [cid:image001.png@01D578FE.74A75F90]
Associated expert for the WMO Commission for Climatology Working Group on Climate Data Exchange (WG-CDE)
Irish Climate Analysis and Research Units (ICARUS) Room 1.8,Laraghbryan House,
North campus, Maynooth University,
Maynooth, Kildare
Ireland
01-7086392 https://www.maynoothuniversity.ie/icarus/our-people/simon-noone [cid:image002.png@01D578FE.74A75F90] [cid:image003.png@01D578FE.74A75F90]
From: Ag Stephensmailto:notifications@github.com
Sent: Thursday 19 November 2020 21:30
To: glamod/glamod-ingestmailto:glamod-ingest@noreply.github.com
Cc: Simon Noonemailto:Simon.Noone@mu.ie; Mentionmailto:mention@noreply.github.com
Subject: [EXTERNAL] Re: [glamod/glamod-ingest] Overwrite data_policy_licence
values in CDMLite land data (#21)
Warning
This email originated from outside of Maynooth University's Mail System. Do not reply, click links or open attachments unless you recognise the sender and know the content is safe.
@sjnoonehttps://github.com/sjnoone: I realise that I had always assumed this fix should be done after the data was extracted by the web-app that queries the CDM.
Would it have been better to have to modify the data before loading it into the database?
Given that I have missed the opportunity to do that, does it make sense to run some code over the whole database to fix each record? It'll take a while but is doable.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/glamod/glamod-ingest/issues/21#issuecomment-730648733, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG3VGPNB4BXYJW6JMGZKOBLSQWFAVANCNFSM4PCPTLMA.
Thanks @sjnoone,
It's a shame I didn't spot that before. But I can get it done. Cheers
Here is the plan for batch processing the contents of the land CDMLite, to overwrite the data_policy_licence
.
NOTE: ideally, this would have been done in the restructure-land.py
script.
We can re-use this code:
https://github.com/glamod/glamod-ingest/blob/master/scripts/land/restructure-land.py#L179
For each record we will:
primary_id
from the observation_id
source_id
data_policy_licence
value from the station configuration tabledata_policy_licence
in the record with the new one from station configuration tableSince we need to do this on all land partitions, we need a strategy:
primary_id
valuesprimary_id
value:
data_policy_licence
data_policy_licence
WHERE observation_id
STARTS WITH primary_id
Hi Ag,
That looks good to me.
Best S
Dr. Simon Noone
Postdoctoral Researcher
https://orcid.org/0000-0003-1661-1423
https://www.maynoothuniversity.ie/icarus/our-people/simon-noone
The Copernicus Climate Change Service C3S 311a Lot 2 Global Land and Marine Observations Database [cid:image001.png@01D578FE.74A75F90]
Associated expert for the WMO Commission for Climatology Working Group on Climate Data Exchange (WG-CDE)
Irish Climate Analysis and Research Units (ICARUS) Room 1.8,Laraghbryan House,
North campus, Maynooth University,
Maynooth, Kildare
Ireland
01-7086392 https://www.maynoothuniversity.ie/icarus/our-people/simon-noone [cid:image002.png@01D578FE.74A75F90] [cid:image003.png@01D578FE.74A75F90]
From: Ag Stephensmailto:notifications@github.com
Sent: Friday 20 November 2020 08:51
To: glamod/glamod-ingestmailto:glamod-ingest@noreply.github.com
Cc: Simon Noonemailto:Simon.Noone@mu.ie; Mentionmailto:mention@noreply.github.com
Subject: [EXTERNAL] Re: [glamod/glamod-ingest] Overwrite data_policy_licence
values in CDMLite land data (#21)
Warning
This email originated from outside of Maynooth University's Mail System. Do not reply, click links or open attachments unless you recognise the sender and know the content is safe.
Part 2: Update the data_policy in the CDM_Lite tables
Here is the plan for batch processing the contents of the land CDMLite, to overwrite the data_policy_licence.
NOTE: ideally, this would have been done in the restructure-land.py script.
We can re-use this code:
https://github.com/glamod/glamod-ingest/blob/master/scripts/land/restructure-land.py#L179
For each record we will:
Since we need to do this on all land partitions, we need a strategy
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/glamod/glamod-ingest/issues/21#issuecomment-731033105, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG3VGPKPZ5XHJYPZIMQOLLLSQYUXJANCNFSM4PCPTLMA.
@agstephens Need to make this part of restructure-land.py
.
This has now been implemented as part of the main processing chain, see:
https://github.com/glamod/glamod-ingest/blob/master/scripts/land/restructure-land.py#L227
Progress checker:
source_id
For LAND data only in the CDMLite:
We need to get the
primary_id
andrecord_number
as the first two components of theobservation_id
.We look up the matching record in
station_configuration
that matches bothprimary_id
andrecord_number
.From that we get the
data_policy_licence
value, which we then overwrite into the CDMLite record.See full details below...