Duplicate observations in CDM lite daily files

sjnoone commented 1 year ago

It looks like the daily_to_cdm_lite.py code is writing out duplicate observations with the same date timestamp. It is happening when stations have observations merged from two different stations but same source_ID. e.g daily station KZ000028676.csv below is teh mingle extrcat fom GHCND.

KZ000028676 6   166 166 229 229 229 238 28676099999 28679099999 000RR114074 000TX114073 000TN114071 28679

In this case this station is merged from 6 stations: 2 from source 166, 3 from source 229 and 1 from source 238.

The subsequent entry in the record_ID.csv has 6 unique record_numbers based on the mingle information:

primary_station_id_3    primary_station_id_2    station_id  record_number   station_name    longitude   latitude height_of_station_above_sea_level  sub_region  region  data_policy_licence
KZ000028676-166-5   KZ000028676-166 KZ000028676 5   PETROPAVLOVSK   69.15   54.833  142 124 2   1
KZ000028676-166-6   KZ000028676-166 KZ000028676 6   PETROPAVLOVSK   69.15   54.833  142 124 2   1
KZ000028676-229-2   KZ000028676-229 KZ000028676 2   PETROPAVLOVSK   69.15   54.833  142 124 2   1
KZ000028676-229-4   KZ000028676-229 KZ000028676 4   PETROPAVLOVSK   69.15   54.833  142 124 2   1
KZ000028676-229-3   KZ000028676-229 KZ000028676 3   PETROPAVLOVSK   69.15   54.833  142 124 2   1
KZ000028676-238-1   KZ000028676-238 KZ000028676 1   PETROPAVLOVSK   69.15   54.833  142 124 2   1

The problem is that I think the code is concatenating the source_id + the primary_ID (station_ID) from the station.csv to create the primary_station_ID2 for joining to the record_ID.csv for adding the data_policy etc. information. See extract below:

     df["primary_station_id"] = df["Station_ID"]
    # Concatenate columns for joining dataframe in next step
    df['source_id'] = df['source_id'].astype(str).apply(lambda x: x.replace('.0', ''))
    df['primary_station_id_2'] = df['primary_station_id'].astype(str) + '-' + df['source_id'].astype(str)
    df['primary_station_id_2'] = df['primary_station_id_2'].astype(str)
    df["observation_value"] = pd.to_numeric(df["observation_value"], errors='coerce')

So I think the code is seeing the created primary_ID2 in the df and joining all the attributes to both of the station_ID record_numbers for that one source, as found in record_ID.csv primary_station_id2 column.

Solution might be to add piece of code at end of daily_to_cdm_lite.py and daily_to_obs_table_v1.py that removes all duplicate observed values : if date_time+observed_variable+observed_value = duplicated then remove one entry.

Or it might be the way the code is joining the two df's ?

or I could edit the record_id.csv so that if a station has been mergeed from two different stations but same source only one entry for that source is present, for example KZ000028676 would look like this in record_ID.csv. Much more work!

primary_station_id_3    primary_station_id_2    station_id  record_number   station_name    longitude   latitude height_of_station_above_sea_level  sub_region  region  data_policy_licence
KZ000028676-166-3   KZ000028676-166 KZ000028676 3   PETROPAVLOVSK   69.15   54.833  142 124 2   1
KZ000028676-229-2   KZ000028676-229 KZ000028676 2   PETROPAVLOVSK   69.15   54.833  142 124 2   1
KZ000028676-238-1   KZ000028676-238 KZ000028676 1   PETROPAVLOVSK   69.15   54.833  142 124 2   1

BUT this way means less traceability in station_configuration file that shows original merged station IDs for each separate record_number entry. Which is created from the config_extract.py that we run on the outputted cdm_lite files and is based on wht has been written out.

primary_id  primary_id_scheme   record_number   secondary_id    secondary_id_scheme station_name  source_ID
KZ000028676 13  1   28679   3   PETROPAVLOVSK  238
KZ000028676 13  5   28679099999 3   PETROPAVLOVSK 166
KZ000028676 13  6   28676099999 3   PETROPAVLOVSK 166

NOTE: the issue is only seems to have occurred with Snow depth at this stations and not sure why ! Could be a different variable at other stations with similar merge/source combinations.

What do you think, it's a tricky one!

rjhd2 commented 1 year ago

Right, I've formetted your issue a little, which might help in working through it. I think we may need to chat about this when you're back in, to help me work out what we want as the output, and how this isn't meeting that need.

sjnoone commented 1 year ago

Hi, I have had time to think about this issue and the best option is to remove all the duplicated primary_station_2 rows from the recpord_id.csv. It wont make any difference because the code will still pick up all the observations and allocate them to the correct source. W e will never be able to provide which station_ID the observations derive from because the ghcnd.csv file do not provide this info. There are 4744 stations with station merged from the same source so i plan to remove the duplicate and re-run these using the clobber option in the code. This will remove any duplicated observations in the cdm output files. Does this sound like a good plan?

sjnoone commented 1 year ago

is this the correct code to overwrite existing files: python3 daily_to_cdm_lite_v1.py --station --clobber FMW00040308

rjhd2 commented 1 year ago

I think you solved this issue by re-running with an updated record_id.csv. Did that work, and can we close this issue?

sjnoone commented 1 year ago

Yes this is fixed now.

Sent from Outlook for iOShttps://aka.ms/o0ukef

From: Robert Dunn @.> Sent: Tuesday, August 29, 2023 3:14:09 PM To: glamod/glamod-nuim @.> Cc: Simon Noone @.>; Assign @.> Subject: [EXTERNAL] Re: [glamod/glamod-nuim] Duplicate observations in CDM lite daily files (Issue #78)

Warning

This email originated from outside of Maynooth University's Mail System. Do not reply, click links or open attachments unless you recognise the sender and know the content is safe.

I think you solved this issue by re-running with an updated record_id.csv. Did that work, and can we close this issue?

— Reply to this email directly, view it on GitHubhttps://github.com/glamod/glamod-nuim/issues/78#issuecomment-1697527611, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG3VGPPSUXNT2OHK6FSNJ23XXX2LDANCNFSM6AAAAAA2TXKSGU. You are receiving this because you were assigned.Message ID: @.***>

rjhd2 commented 1 year ago

Closed

glamod / glamod-nuim

Duplicate observations in CDM lite daily files #78