Closed j23414 closed 8 months ago
Before addressing the refactor of create_lat_longs.py
, it was necessary to clean up the region
, country
, division
and location
metadata fields for USA records.
For future reference, the process is documented below:
We usually provide geolocation rules in the ingest/source-data/geolocation-rules.tsv file where the left column is the pattern to identify a geolocation problem and the right column fixes it.
To ease identification of geolocation problems in the data, and generation of potential geolocation rules to fix those problems; you can run:
cat data/metadata_all.tsv \
| awk -F'\t' 'NR==1 || $4=="USA" , OFS="/" {print $3,$4,$5,$6"\t"$3,$4,$6,$5 }' \
| sort \
| uniq \
> potential_geolocation_rules.txt
Then look through the potential_geolocation_rules.txt
, manually edit, and add any necessary rules to ingest/source-data/geolocation-rules.tsv.
1) State was listed in division
instead of location
: fix by swapping location
and division
fields
# region/country/division/location region/country/location/division
North America/USA/Agoura_Hills/California North America/USA/California/Agoura_Hills # Swapped
2) State only listed as an abbreviation: fix by manually adding the appropriate state name
# region/country/division/location region/country/location/division
North America/USA/Amarillo Tx/ North America/USA//Amarillo Tx # This is the original potential rule
North America/USA/Amarillo Tx/ North America/USA/Texas/Amarillo Tx # This is the manually fixed rule, by adding "Texas"
3) State misspelled: fix by writing a general rule
Original where 'Connecticut' is misspelled as 'Conneticut' multiple times:
# region/country/division/location region/country/location/division
North America/USA/Conneticut/Bloomfield North America/USA/Bloomfield/Conneticut
North America/USA/Conneticut/Branford North America/USA/Branford/Conneticut
North America/USA/Conneticut/Bristol North America/USA/Bristol/Conneticut
North America/USA/Conneticut/Brooklyn North America/USA/Brooklyn/Conneticut
Replace with generalized rule:
# Misspelled left, corrected right
North America/USA/Conneticut/* North America/USA/Connecticut/*
Since migrating to the NCBI datasets to pull public metadata.tsv, the geolocation fields have changed. This commit updates the create_lat_longs script to work with the new fields.
From NCBI datasets, the geolocation fields from general to specific are:
From the ingest pipeline, derive the 2 letter state abbreviations and populate the 'state' metadata field.
The create_lat_longs.py
script has been updated accordingly in https://github.com/NW-PaGe/WNV-nextstrain/pull/3/commits/c6b1e0ffed097cb53ddb1a770928f78f3c921d4c but let me know if you have any questions or want a walk-through.
Description
Migrate to using NCBI dataset provided geolocation data, clean up curation of USA locations.