glamod / glamod-ingest

Database preparation and ingestion for GLAMOD
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Data Error: station_name with whitespace in the middle #38

Closed agstephens closed 3 years ago

agstephens commented 3 years ago

Hi @sjnoone, I just looked at the first gzipped PSV file to restructure it, and I saw this:

$ cat /gws/nopw/j04/c3s311a_lot2/data/level2/land/r202005/cdm_lite/daily/CDM_lite_SecondRelease_AF000040930.psv.gz | gunzip -c | head -3
observation_id|report_type|date_time|date_time_meaning|latitude|longitude|observation_height_above_station_surface|observed_variable|units|observation_value|value_significance|observation_duration|platform_type|station_type|primary_station_id|station_name|quality_flag|data_policy_licence 
AF000040930-2-1976-01-01-85-2|3|1976-01-01 00:00:00+00|1|35.317|69.017|2|85|005|265.65|2|13||1|AF000040930|NORTH-SALANG                   GSN|0|1
AF000040930-1-1976-01-01-85-1|3|1976-01-01 00:00:00+00|1|35.317|69.017|2|85|005|265.15|1|13||1|AF000040930|NORTH-SALANG                   GSN|0|1

And:

$ cat /gws/nopw/j04/c3s311a_lot2/data/level2/land/r202005/cdm_lite/daily/CDM_lite_SecondRelease_AG000060611.psv.gz | gunzip -c | head -4
observation_id|report_type|date_time|date_time_meaning|latitude|longitude|observation_height_above_station_surface|observed_variable|units|observation_value|value_significance|observation_duration|platform_type|station_type|primary_station_id|station_name|quality_flag|data_policy_licence 
AG000060611-2-1971-01-01-44-13|3|1971-01-01 00:00:00+00|1|28.05|9.6331|1|44|710|0.0|13|13||1|AG000060611|IN-AMENAS                      GSN|0|1
AG000060611-1-1972-01-01-85-1|3|1972-01-01 00:00:00+00|1|28.05|9.6331|2|85|005|279.15|1|13||1|AG000060611|IN-AMENAS                      GSN|0|1
AG000060611-1-1972-01-01-85-0|3|1972-01-01 00:00:00+00|1|28.05|9.6331|2|85|005|298.05|0|13||1|AG000060611|IN-AMENAS                      GSN|0|1

The station_name field has big chunks of white space in the middle of the string.

What should the correct fix be for this? Should there ever be any white space in station_name? If not, then I can just do something like this to fix it:

df['station_name'].str.replace(' ', '')

Thanks

agstephens commented 3 years ago

It looks like another file has the station as: "ST JOHNS" - so it looks like whitespace is allowed, so should I just replace multiple white spaces with one white space? Thanks

sjnoone commented 3 years ago

FINAL DECISION:

Replace any instance of multiple white spaces with a single white space.

agstephens commented 3 years ago

Fixed in commit: 4cc1f8b9f47079728adbb8eb711fd6e4b76a396d

agstephens commented 3 years ago

All done.