Open joverlee521 opened 4 weeks ago
The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?
I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.
The alternative would be warning loudly and providing instructions on how to use the geo location file to override bad annotations?
I always worry that automatically fixing things like this will actually inject difficult-to-detect errors.
That's fair! We'd still have to use something like pycountry to detect these mix-ups to warn the users about them.
Currently,
parse-genbank-location
strictly follows GenBank's documented pattern for geo_loc_name:https://github.com/nextstrain/augur/blob/66e903af7eafe07c0418d8cc065a1c754a833caf/augur/curate/parse_genbank_location.py#L19-L23
However, the GenBank records don't always follow this pattern as shown in https://github.com/nextstrain/rabies/issues/10.
Should
parse-genbank-location
be automatically fixing these region/locality mix ups? We've previously done this in ncov-ingest specifically for USA locations by checking for US state codes but we can do a more generalized check with something like pycountry.