owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
80 stars 22 forks source link

Country harmonizer improvements #3457

Closed spoonerf closed 2 days ago

spoonerf commented 2 weeks ago

I had a couple of ideas for small improvements to the country harmonize:

Marigold commented 4 days ago

We shouldn't automatically assign entities that happen to have the same name as OWID continents to those entities, e.g., Africa in a new dataset currently automatically becomes OWID Africa, and it may actually be defined differently. Not a huge deal but it is an extra step to remember.

@spoonerf are you sure this is happening in the CLI? I tried it and Africa became Africa.

spoonerf commented 4 days ago

Ah, sorry, I didn't mean it literally becomes 'OWID Africa'; I just meant that Africa, as defined in many of our input datasets, will not be made up of the same countries as we define it as (and the other continents, too).

So it would be good if we could treat the continent entities differently in the country harmonizer so that we always have to check if they are the same or if they should be treated differently, e.g., if Africa should be Africa ([DATASET NAME]).

For example, in this recent PR , the data is only regional (including Africa, Asia, Europe), but their regions aren't defined anywhere so we should be careful not to assume that these regions are the same.

Marigold commented 4 days ago

Currently, you can use e.g. --institution UN which will ask you about continents and suggest Continent (UN) as the first choice. Does that help? We could also explicitly ask for an institution in harmonize as the first question.

spoonerf commented 4 days ago

Ah cool, I didn't know that existed! I just tried it out and it asks you for continents that don't match our existing ones, e.g. it asked me about 'Americas', but not the others e.g. Africa, Asia, Europe.

I think the ideal situation would be that it always asks about Asia, Europe, North America, South America and Oceania and that these are not automatically mapped. However, if this is tricky, I'd say it's a pretty minor pain point, so maybe not worth spending too much time on.