nextstrain / WNV

the repository used to build West Nile Virus for nextstrain
https://nextstrain.org/WNV/NA
2 stars 0 forks source link

Update the geolocation rules and lat long processing for NCBI datasets data #3

Closed j23414 closed 8 months ago

j23414 commented 9 months ago

Description

Migrate to using NCBI dataset provided geolocation data, clean up curation of USA locations.

j23414 commented 9 months ago

Prework

Before addressing the refactor of create_lat_longs.py, it was necessary to clean up the region, country, division and location metadata fields for USA records.

For future reference, the process is documented below:

Identify Potential Geolocation Rules for USA records

We usually provide geolocation rules in the ingest/source-data/geolocation-rules.tsv file where the left column is the pattern to identify a geolocation problem and the right column fixes it.

To ease identification of geolocation problems in the data, and generation of potential geolocation rules to fix those problems; you can run:

cat data/metadata_all.tsv \
  | awk -F'\t' 'NR==1 || $4=="USA" , OFS="/" {print $3,$4,$5,$6"\t"$3,$4,$6,$5 }' \
  | sort \
  | uniq \
  > potential_geolocation_rules.txt

Then look through the potential_geolocation_rules.txt, manually edit, and add any necessary rules to ingest/source-data/geolocation-rules.tsv.

Geolocation rules can be used to fix the following issues:

1) State was listed in division instead of location: fix by swapping location and division fields

# region/country/division/location  region/country/location/division
North America/USA/Agoura_Hills/California   North America/USA/California/Agoura_Hills # Swapped

2) State only listed as an abbreviation: fix by manually adding the appropriate state name

# region/country/division/location  region/country/location/division
North America/USA/Amarillo Tx/  North America/USA//Amarillo Tx # This is the original potential rule
North America/USA/Amarillo Tx/  North America/USA/Texas/Amarillo Tx # This is the manually fixed rule, by adding "Texas"

3) State misspelled: fix by writing a general rule

Original where 'Connecticut' is misspelled as 'Conneticut' multiple times:

# region/country/division/location  region/country/location/division
North America/USA/Conneticut/Bloomfield North America/USA/Bloomfield/Conneticut
North America/USA/Conneticut/Branford   North America/USA/Branford/Conneticut
North America/USA/Conneticut/Bristol    North America/USA/Bristol/Conneticut
North America/USA/Conneticut/Brooklyn   North America/USA/Brooklyn/Conneticut

Replace with generalized rule:

# Misspelled left, corrected right
North America/USA/Conneticut/* North America/USA/Connecticut/*
j23414 commented 9 months ago

Actual work

Since migrating to the NCBI datasets to pull public metadata.tsv, the geolocation fields have changed. This commit updates the create_lat_longs script to work with the new fields.

From NCBI datasets, the geolocation fields from general to specific are:

From the ingest pipeline, derive the 2 letter state abbreviations and populate the 'state' metadata field.

The create_lat_longs.py script has been updated accordingly in https://github.com/NW-PaGe/WNV-nextstrain/pull/3/commits/c6b1e0ffed097cb53ddb1a770928f78f3c921d4c but let me know if you have any questions or want a walk-through.