Closed spoonerf closed 2 days ago
We shouldn't automatically assign entities that happen to have the same name as OWID continents to those entities, e.g., Africa in a new dataset currently automatically becomes OWID Africa, and it may actually be defined differently. Not a huge deal but it is an extra step to remember.
@spoonerf are you sure this is happening in the CLI? I tried it and Africa
became Africa
.
Ah, sorry, I didn't mean it literally becomes 'OWID Africa'; I just meant that Africa, as defined in many of our input datasets, will not be made up of the same countries as we define it as (and the other continents, too).
So it would be good if we could treat the continent entities differently in the country harmonizer so that we always have to check if they are the same or if they should be treated differently, e.g., if Africa should be Africa ([DATASET NAME]).
For example, in this recent PR , the data is only regional (including Africa, Asia, Europe), but their regions aren't defined anywhere so we should be careful not to assume that these regions are the same.
Currently, you can use e.g. --institution UN
which will ask you about continents and suggest Continent (UN)
as the first choice. Does that help? We could also explicitly ask for an institution in harmonize
as the first question.
Ah cool, I didn't know that existed! I just tried it out and it asks you for continents that don't match our existing ones, e.g. it asked me about 'Americas', but not the others e.g. Africa, Asia, Europe.
I think the ideal situation would be that it always asks about Asia, Europe, North America, South America and Oceania and that these are not automatically mapped. However, if this is tricky, I'd say it's a pretty minor pain point, so maybe not worth spending too much time on.
I had a couple of ideas for small improvements to the country harmonize:
Add an 'exclude' option (similar to
skip
andcustom
) to the harmonizer CLI, which would automatically add the entity toexcluded_countries.json
We shouldn't automatically assign entities that happen to have the same name as OWID continents to those entities, e.g., Africa in a new dataset currently automatically becomes OWID Africa, and it may actually be defined differently. Not a huge deal but it is an extra step to remember.