Closed sjnoone closed 1 year ago
Right, I've formetted your issue a little, which might help in working through it. I think we may need to chat about this when you're back in, to help me work out what we want as the output, and how this isn't meeting that need.
Hi, I have had time to think about this issue and the best option is to remove all the duplicated primary_station_2 rows from the recpord_id.csv. It wont make any difference because the code will still pick up all the observations and allocate them to the correct source. W e will never be able to provide which station_ID the observations derive from because the ghcnd.csv file do not provide this info. There are 4744 stations with station merged from the same source so i plan to remove the duplicate and re-run these using the clobber option in the code. This will remove any duplicated observations in the cdm output files. Does this sound like a good plan?
is this the correct code to overwrite existing files: python3 daily_to_cdm_lite_v1.py --station --clobber FMW00040308
I think you solved this issue by re-running with an updated record_id.csv. Did that work, and can we close this issue?
Yes this is fixed now.
Sent from Outlook for iOShttps://aka.ms/o0ukef
From: Robert Dunn @.> Sent: Tuesday, August 29, 2023 3:14:09 PM To: glamod/glamod-nuim @.> Cc: Simon Noone @.>; Assign @.> Subject: [EXTERNAL] Re: [glamod/glamod-nuim] Duplicate observations in CDM lite daily files (Issue #78)
Warning
This email originated from outside of Maynooth University's Mail System. Do not reply, click links or open attachments unless you recognise the sender and know the content is safe.
I think you solved this issue by re-running with an updated record_id.csv. Did that work, and can we close this issue?
— Reply to this email directly, view it on GitHubhttps://github.com/glamod/glamod-nuim/issues/78#issuecomment-1697527611, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG3VGPPSUXNT2OHK6FSNJ23XXX2LDANCNFSM6AAAAAA2TXKSGU. You are receiving this because you were assigned.Message ID: @.***>
Closed
It looks like the daily_to_cdm_lite.py code is writing out duplicate observations with the same date timestamp. It is happening when stations have observations merged from two different stations but same source_ID. e.g daily station KZ000028676.csv below is teh mingle extrcat fom GHCND.
In this case this station is merged from 6 stations: 2 from source 166, 3 from source 229 and 1 from source 238.
The subsequent entry in the
record_ID.csv
has 6 unique record_numbers based on the mingle information:The problem is that I think the code is concatenating the source_id + the primary_ID (station_ID) from the
station.csv
to create the primary_station_ID2 for joining to therecord_ID.csv
for adding the data_policy etc. information. See extract below:So I think the code is seeing the created primary_ID2 in the df and joining all the attributes to both of the station_ID record_numbers for that one source, as found in record_ID.csv primary_station_id2 column.
Solution might be to add piece of code at end of
daily_to_cdm_lite.py
anddaily_to_obs_table_v1.py
that removes all duplicate observed values : if date_time+observed_variable+observed_value = duplicated then remove one entry.Or it might be the way the code is joining the two df's ?
or I could edit the
record_id.csv
so that if a station has been mergeed from two different stations but same source only one entry for that source is present, for example KZ000028676 would look like this in record_ID.csv. Much more work!BUT this way means less traceability in
station_configuration
file that shows original merged station IDs for each separaterecord_number
entry. Which is created from theconfig_extract.py
that we run on the outputtedcdm_lite
files and is based on wht has been written out.NOTE: the issue is only seems to have occurred with Snow depth at this stations and not sure why ! Could be a different variable at other stations with similar merge/source combinations.
What do you think, it's a tricky one!