nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

Should `parse-genbank-location` automatically fix region/locality mix ups? #1578

Open joverlee521 opened 4 weeks ago

joverlee521 commented 4 weeks ago

Currently, parse-genbank-location strictly follows GenBank's documented pattern for geo_loc_name:

https://github.com/nextstrain/augur/blob/66e903af7eafe07c0418d8cc065a1c754a833caf/augur/curate/parse_genbank_location.py#L19-L23

However, the GenBank records don't always follow this pattern as shown in https://github.com/nextstrain/rabies/issues/10.

Should parse-genbank-location be automatically fixing these region/locality mix ups? We've previously done this in ncov-ingest specifically for USA locations by checking for US state codes but we can do a more generalized check with something like pycountry.

genehack commented 3 weeks ago

The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?

I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.

joverlee521 commented 2 weeks ago

The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?

I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.

That's fair! We'd still have to use something like pycountry to detect these mix-ups to warn the users about them.