owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
79 stars 21 forks source link

Redefine countries-regions dataset #779

Closed pabloarosado closed 1 year ago

pabloarosado commented 1 year ago

We need to redefine our current countries-regions dataset.

Requirements

Issues

Other considerations

Related issues

Let's propose alternatives in the comments.

pabloarosado commented 1 year ago

A few general suggestions:

  1. I think that having an owid namespace in ETL is not very useful. We should instead use more descriptive namespaces, like demography.
  2. It's also not very useful to have table names key_indicators and reference. We should instead use more descriptive names.
  3. We should keep versions for all datasets. It's problematic to have a "latest" population dataset, because, when it changes, it affects many downstream datasets. It should be up to us whether a dataset needs to have a version change or not (e.g. if it's a minor change, we may decide to keep the same version).
pabloarosado commented 1 year ago

Proposal 1

Table region_definitions

Contains the list of harmonized countries and regions, with their country codes. This dataset is static, so, even if it has a version, it will probably almost never change.

+-------------+--------------+---------------+-----------------+
| region      | owid_code    | region_type   | is_historical   |
+=============+==============+===============+=================+
| Afghanistan | AFG          | country       | False           |
+-------------+--------------+---------------+-----------------+
| France      | FRA          | country       | False           |
+-------------+--------------+---------------+-----------------+
| Russia      | RUS          | country       | False           |
+-------------+--------------+---------------+-----------------+
| Spain       | ESP          | country       | False           |
+-------------+--------------+---------------+-----------------+
| Europe      | OWID_EUR     | continent     | False           |
+-------------+--------------+---------------+-----------------+
| USSR        | OWID_USS     | country       | True            |
+-------------+--------------+---------------+-----------------+
| Andalusia   | OWID_ESP_AND | sub_country   | False           |
+-------------+--------------+---------------+-----------------+

Here I have included region_type and is_historical, but we could include other things, like ISO2, ISO3, etc., although I've never used most of them. We could also include special regions, e.g. "Southern hemisphere" (an entity in the dataset of temperature anomaly), or custom regions defined by institutions, e.g. "Europe (BP)", with code BP_EUR.

Table region_aliases

Contains region harmonized name and alias.

+---------------+-------------------+
| region        | alias             |
+===============+===================+
| United States | US                |
+---------------+-------------------+
| United States | USA               |
+---------------+-------------------+
| Italy         | Republic of Italy |
+---------------+-------------------+

Table region_members

Contains region harmonized name and a member of that region.

+---------------------+---------------+
| region              | member        |
+=====================+===============+
| North America       | United States |
+---------------------+---------------+
| Europe              | France        |
+---------------------+---------------+
| Europe              | Spain         |
+---------------------+---------------+
| European Union (27) | Spain         |
+---------------------+---------------+
| Europe              | Russia        |
+---------------------+---------------+
| Europe              | USSR          |
+---------------------+---------------+
| Europe              | East Germany  |
+---------------------+---------------+
| Spain               | Andalusia     |
+---------------------+---------------+

Note: Here we would not add "memberships" like ("USSR", "Armenia"), since we consider Armenia was not a part of the USSR, but instead did not exist until the USSR dissolved.

Table region_transitions

Contains the harmonized name of a region, the harmonized name of a successor, and the year when the transition occurred.

+--------------+-------------+--------+
| region       | successor   |   year |
+==============+=============+========+
| USSR         | Armenia     |   1991 |
+--------------+-------------+--------+
| USSR         | Azerbaijan  |   1991 |
+--------------+-------------+--------+
| USSR         | Russia      |   1991 |
+--------------+-------------+--------+
| East Germany | Germany     |   1990 |
+--------------+-------------+--------+
| West Germany | Germany     |   1990 |
+--------------+-------------+--------+

Note: We should assert that each historical region has only one transition year. For example, USSR would dissolve in 1991. If we wanted to have one USSR with Lithuania until 1990, and one without, we'd need to define two different USSR's (e.g. USSR (pre 1990) and USSR (1990)), but this may be an over-complication.

pabloarosado commented 1 year ago

To be more specific, we can distinguish two versions of the previous proposal:

I think I'd prefer Proposal 1b, although this implies we need to be a bit more careful deciding what a minor and a major update should be. For example, adding an alias should not imply a new version. But maybe adding a new historical transition would be.

pabloarosado commented 1 year ago

Proposal 2

All information (on definitions, aliases, members and historical transitions) is packed into just one dataset with one table, called simply regions.

+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| country       | owid_code   | region_type   | end_year   | aliases                        | members                    | successors                       |
+===============+=============+===============+============+================================+============================+==================================+
| United States | USA         | country       | <NA>       | ['United States', 'US', 'USA'] | []                         | []                               |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| Italy         | ITA         | country       | <NA>       | ['Republic of Italy']          | []                         | []                               |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| USSR          | USS         | country       | 1991       | ['Soviet Union']               | []                         | ['Armenia', 'Azerbaijan', '...'] |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| Europe        | OWID_EUR    | continent     | <NA>       | []                             | ['Spain', 'France', '...'] | []                               |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+

This may be more convenient for maintenance (although, like in Proposal 1b it may require to be careful deciding what a minor and major update should be). Additionally, given that it could fully contain all columns in our current countries_regions.csv table, it would make the refactor a bit easier.

A downside of this, compared to Proposal 1 is that, if we add a sub-country region, e.g. "Andalusia", we also need to edit previous entries accordingly, e.g. "Spain", "Europe", "European Union" would now have "Andalusia" added to the list of members. On the other hand, in Proposal 1 we would just add a new row to region_members.

pabloarosado commented 1 year ago

Regardless of the chosen proposal for primary tables on region definitions, aliases, members and transitions, we would have:

Other primary tables

Like currently, we would also have a table for region_population and region_area. They would contain region harmonized name, year, and population/area. And it can also include estimates of certain regions into the past or future, like Russia going back before 1991, or World going into the future.

Other derived tables

Having just the previous definitions of primary tables would let us do all our usual operations, like country name harmonization, or region aggregates, "on the flight" (i.e. without needing to define derived tables anywhere).

For example, imagine we have a dataframe with rows for France, Spain, USSR, and Russia, for 1990, 1991, and 1992. Then if we wanted to build the aggregate for Europe, our function to create region aggregates could (from minimum requirements to nice-to-have):

  1. Load that dataframe and the region_transitions table, and raise a warning saying that Historical region USSR overlaps with successor Russia on years [1990, 1991, 1992], and sum data for France, Spain and Russia for 1990, 1991, and 1992. For this, we don't even need to know the transition year (although it's good to keep it to check that regions are well-defined). The function would also ingest an optional dictionary (e.g. known_overlaps) that would check for expected overlaps, and ignore warnings for them (for each overlap, we may also need to know whether we want to add the contribution of both regions, e.g. for transition years, or just of one of them).
  2. Same as 1, but also prioritise successors' data after the transition year, and historical region prior to that year, so it would add data for France, Spain and USSR for 1990 and 1991, and France, Spain and Russia for 1991.
  3. Same as 1 and 2, but also load region_members and raise a warning if there's an overlap between country and sub-country regions (e.g. Spain and Andalusia). But this may be an over-complication, at least for now.
  4. Output an additional dataframe that contains, for each region, year, and variable, the list of countries that were informed. This could be stored as an auxiliary table in the dataset, that would let us decide how to compute, e.g. per capita variables (dividing only by the population of informed countries for each variable).
lucasrodes commented 1 year ago

Hi @pabloarosado, thanks for adding so much value to the discussion with your comments and proposals.

Overall, I have a preference for proposal 1. I think it is more flexible and can adapt to many more use cases.

IMHO, we should use owid_code in all tables instead of the region names themselves. If we were to change some name in the future (e.g. TimorTimor-Leste) this should be simple: only change one entry in region_definitions (and not in all tables).

Find some of my thoughts below.

On proposal 1

region_definitions

+-------------+--------------+---------------+-----------------+------------+
| region      | owid_code    | region_type   | is_historical   | defined_by |
+=============+==============+===============+=================+============+
| Afghanistan | AFG          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| France      | FRA          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Russia      | RUS          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Spain       | ESP          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Europe      | OWID_EUR     | continent     | False           |    OWID    |
+-------------+--------------+---------------+-----------------+------------+
| Europe      | EXT_BP_EUR   | continent     | False           |     BP     |
+-------------+--------------+---------------+-----------------+------------+
| USSR        | OWID_USS     | country       | True            |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Andalusia   | OWID_ESP_AND | sub_country   | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+

region_aliases

This table would be perfect as you present it.

region_members

This one is challenging, as it would be a many-to-many table. This does not make it necessarily a bad choice, just that people should be aware of this when working with it.

region_transitions

LGTM

additional tables

We could consider adding extra tables:


Additional comments

Overall, I agree that Proposal 1b makes more sense. As we change something in one table, we need to run sanity checks on others to ensure the whole set of tables is consistent. Therefore, keeping all these tables under the same dataset would be nice. It is easier and cleaner to ensure consistency across a set of tables within a dataset than across a set of datasets.

As I see with this structure, to obtain the list of countries for a continent for a specific year, we would need to:

Right?

While it is true that I prefer this structure, we need solid and robust sanity checks to make sure that they are consistent.

JoeHasell commented 1 year ago

Hey @pabloarosado, one small thought to throw in.

It wasn’t clear to me exactly how ‘complete’ a mapping you have in mind with region_transitions.

But I just wanted to flag – having looked into it a couple of times – that this is a bit of a can of worms. Saying when a state existed and when it didn’t, when a state was actually a component of another state etc. is quite a fuzzy thing. In social science research, people rely on the efforts of researchers who stick their necks on the line to come up with a ‘state system’. Gledistch and Ward is one such system that gets used, but there are others. And they can disagree, or result in counter-intuitive classifications because they're trying to use some fixed definition. Bastian has also had the idea of coming up with OWID maintained set of historical boundaries. It's possibly something we could do. But I just wanted to flag that it would be quite a big research undertaking – the kind of thing that people publish in journal articles. And overall I am a bit pessimistic that this is a good use of our time.

Moreover, even if we were to come up with or adopt a ‘state system’, there is always the ambiguity that we do not know if the providers of a given dataset we are using have the same state system in mind. Or, more likely, if they have no particular state system in mind at all, or their data is a funny mix of different elements from various geographical boundaries.

Just as one example of the mess: the WID data on inequality that Pablo A is working on, for say Germany. To find out what ‘Germany’ means here you have to dig up the accompanying research paper (and there isn’t even a clear mapping of countries to research papers). There you find that their definition of Germany is using ‘prevailing borders’ (without defining what that means). I would say that's a good example of documentation. Often you wouldn't even get that much.

As I say, I'm not sure exactly what you have in mind with it, so maybe these thoughts aren't so relevant. Just wanted to flag the very messy nature of it in my experience, in case that's helpful context.

pabloarosado commented 1 year ago

Hi @lucasrodes thanks for the suggestions. I think I agree with you all your suggestions. What I'm not sure about is whether 1b or 2 is better (I agree that 1a is inferior for the reasons you said).

We can easily implement as many tests in 1b as in 2, since it's all going to be run on the same data step. I think that, if these were tables in a database, 1b would be much preferable. But in practice, we are talking about relatively small csv files, so it feels convenient to have all we need related to regions in just one place (aka Proposal 2). But making changes and reviewing them would be much clearer in 1b. So I'm not sure, it would be good to have some more opinions to decide.

pabloarosado commented 1 year ago

Hi @JoeHasell thanks for pointing out these issues. Please note that the region_transitions table, in principle, is only needed when creating region aggregates. Other than that, I don't think we'll be using it much, and as a general rule, we'll always adopt the criteria of the individual data providers.

In other words, we will always show data as given by the original source (for example, if they extend Russia or Germany to many years back before 1990, we'll show exactly that). But when we construct aggregated data for, e.g. "Europe", we want to stick to some definitions, because we need to ensure that:

  1. We are not double-counting regions (e.g. USSR and Russia, or East/West Germany and Germany).
  2. "Europe" means the same set of countries in all our charts on our site.

So, having a table with reasonable definitions of transitions and successor countries would be helpful, even if it's not totally historically accurate. The same applies when constructing aggregates for income groups. And, if a specific dataset has very unusual definitions and it's absolutely incompatible with our definitions, then we would simply not build aggregates for that dataset. Does it sound more reasonable now? Thanks.

danyx23 commented 1 year ago

Great write-up, thanks a lot! I think I lean towards 1b. In theory we could also consider using a small SQLite db for keeping the various tables in sync with each other "for free" (i.e. just with using foreign key constraints) but that is probably too different to what we have so far in the catalog.

I agree that for the region_transitions it makes a lot of sense to be very pragmatic and basically view this through the lens of "what data do we have where historic regions are important" and try to create a sensible solution for that. I would guess that for the EU this would also be useful.

I'll digest this a bit and then will probably have some more comments here or in an applicable call.

Marigold commented 1 year ago

I don't have anything creative to add. Just perhaps that having the definitions stored as YAML and then creating your tables on the fly might be more flexible (e.g. you could have "added" or "removed" members for periods) and easier to manage than CSV files. It could look something like this (data is most likely wrong).

- country: France
  iso: FRA

- country: Czechia
  iso: CZE
  aliases: [Czech Republic]

- country: Czechoslovakia
  iso: CSK
  periods:
    - years: 1918-1992
      members: [CZE, SVK]
  aliases: [Former Czechoslovakia]

- country: Russia
  iso: RUS
  aliases: [Russian Federation]
  periods:
    - years: 1922-1992
      members: [SU]
    - years: 1992-

- country: USSR
  iso: SU
  periods:
    - years: 1922-1991
      members: [CSK]

- country: Europe
  iso: OWID_EUR
  periods:
    - members: [ALA, ALB, AND, AUT, OWID_AUH, OWID_BAD, OWID_BAV, BLR, BEL, BIH, BGR, OWID_CIS, HRV, CZE, OWID_CZS, DNK, OWID_GDR, EST, FRO, FIN, FRA, DEU, GIB, GRC, GGY, OWID_HAN, OWID_HSE, OWID_HSG, HUN, ISL, IRL, IMN, ITA, JEY, OWID_KOS, LVA, LIE, LTU, LUX, MLT, OWID_MEC, OWID_MOD, MDA, MCO, MNE, NLD, MKD, NOR, OWID_PMA, POL, PRT, ROU, RUS, SMR, OWID_SAX, SRB, SVK, SVN, ESP, SJM, SWE, CHE, OWID_TUS, OWID_SIC, UKR, GBR, VAT, OWID_GFR, OWID_WRT, CYP, OWID_USS, OWID_SRM, OWID_TRS, OWID_YGS, OWID_SEK]

- country: European Union (27)
  iso: OWID_EU27
  aliases: [EU, EU-27, EU27]
  periods:
    - years: 1995-2003
      members: [AUT, BEL, BGR, HRV, CYP, CZE, DNK, EST, FIN, FRA, DEU, GRC, HUN, IRL, ITA, LVA, LTU, LUX, MLT, NLD, POL, PRT, ROU, SVK, SVN, ESP, SWE]
    - years: 2004-
      added_members: [CZE]
pabloarosado commented 1 year ago

Thanks everyone. I'll try to put together here all the previous suggestions (plus Max's comment saying that some country names are too long and make Marimekko charts worse).

Proposal 3

As @Marigold suggested, all data could be handled in a single (possibly big) yaml file adjacent to the data step. The generated dataset (called regions) in the demography namespace would contain the following tables:

MINOR UPDATE:

Table definitions

Changes with respect to 1b:

+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| region_code  | name                         | short_name   | region_type   | is_historical   | defined_by   |
+==============+==============================+==============+===============+=================+==============+
| owid_afg     | Afghanistan                  | Afghanistan  | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_fra     | France                       | France       | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_rus     | Russia                       | Russia       | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_esp     | Spain                        | Spain        | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_eur     | Europe                       | Europe       | continent     | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_uss     | Former USSR                  | USSR         | country       | True            | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_esp_and | Andalusia                    | Andalusia    | sub_country   | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| bp_eur       | Europe (BP)                  | Europe (BP)  | continent     | False           | bp           |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_drc     | Democratic Republic of Congo | DR Congo     | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+

Table aliases

Changes:

+-------------+-------------------+
| region_code | alias             |
+=============+===================+
| owid_us     | United States     |
+-------------+-------------------+
| owid_us     | US                |
+-------------+-------------------+
| owid_us     | USA               |
+-------------+-------------------+
| owid_ita    | Italy             |
+-------------+-------------------+
| owid_ita    | Republic of Italy |
+-------------+-------------------+

Table members

Changes:

+-------------+--------------+
| region_code | member_code  |
+=============+==============+
| owid_nam    | owid_usa     |
+-------------+--------------+
| owid_eur    | owid_fra     |
+-------------+--------------+
| owid_eur    | owid_esp     |
+-------------+--------------+
| owid_eu27   | owid_esp     |
+-------------+--------------+
| owid_eur    | owid_rus     |
+-------------+--------------+
| owid_eur    | owid_uss     |
+-------------+--------------+
| owid_eur    | owid_egr     |
+-------------+--------------+
| owid_esp    | owid_esp_cat |
+-------------+--------------+

Table transitions

Changes:

+---------------+------------------+--------+
| region_code   | successor_code   |   year |
+===============+==================+========+
| owid_uss      | owid_arm         |   1991 |
+---------------+------------------+--------+
| owid_uss      | owid_aze         |   1991 |
+---------------+------------------+--------+
| owid_uss      | owid_rus         |   1991 |
+---------------+------------------+--------+
| owid_egr      | owid_ger         |   1990 |
+---------------+------------------+--------+
| owid_wgr      | owid_ger         |   1990 |
+---------------+------------------+--------+

Additional table legacy

We could add here all the codes (including wikidata urls) that are in the current countries_regions.csv file, although I don't know if we ever use most of them.

+-------------+--------------+--------------+
| region_code | iso_alpha2   | iso_alpha3   |
+=============+==============+==============+
| owid_ita    | IT           | ITA          |
+-------------+--------------+--------------+
| owid_uss    | SU           | USS          |
+-------------+--------------+--------------+

Comments

lucasrodes commented 1 year ago

LGTM!

Minor amends:

On your questions

  1. @lucasrodes I haven't implemented your suggestion about region type ids. Do you have a strong reason why we should add that abstraction?

Not really. I thought it could be nice if we ever wanted to change region-type names. But that seems unlikely. No strong reason.

  1. @lucasrodes regarding region start and end year, if those are fixed quantities, then shouldn't they be added to definitions? I'm also not sure if we need them. Most of the start years will be left empty to begin with (it would be quite complicated to come up with a start year for all regions, in the same way that it's tricky to find transition years). Possibly what we have in transitions is just enough. Or can you think of a reason why we definitely need a start_year?

I am not sure if I'd put it in the definitions table, as (i) several entities would have NaNs and (ii) a single entity could appear more than once (e.g. a country is born, then occupied for 100 years, and re-born?).

I think that this does not look super urgent, and I am probably biased because I am working on the History of War project (https://github.com/owid/owid-issues/issues/443). We can think a bit more on this on a separate issue and proceed with what you have proposed.

danyx23 commented 1 year ago

Short names make sense and a column like that is a good solution IMHO.

For the owid codes - for downstream users it might be nicer if we stay with using iso alpha-3 3 letter codes where our definition is the same as the iso alpha one, don't you think? Then you could join the bulk of our data easily and only special entities like country groupings, historic entities etc would have to have additional matching steps.

How should we go about creating owid codes for changing entities like the EU? Should we have two Eu definitions, like this?

Or should we have several definitions like OWID_EU_1973_1980, OWID_EU_1981_1985. Datasets that don't bother with the changing composition could then reference e.g. OWID_EU_2021. (ah, how do you deal with the latest bracket?)

We could also combine both and group these year spanning entities together over time into OWID_EU. Let's chat about this in the call.

pabloarosado commented 1 year ago

Hi @lucasrodes just a minor clarification on start years. If a region disappears and appears again, then the new one should be considered a different region. This is similar to the example I mentioned before about the USSR: Lithuania and a few countries left in 1990, and others left in 1991. We currently simplify the definition by saying that all countries left in 1991. But otherwise, we would need to define owid_uss_1922_1990 (named, e.g. USSR (pre 1990)) and owid_uss_1990_1991 (named, e.g. USSR (1990)). Given this, then we still don't need a start year. If a country is succeded by another country, then it suffices to know the end year of the former country.

pabloarosado commented 1 year ago

Hi @danyx23,

Related to the last point, we are always assuming that, when harmonizing country names in a dataset, we are simply changing their names to our names, e.g. EU -> European Union (27) (disregarding region codes). This is what we currently do anyway. But we don't need to map their regions to our region codes. So, if a certain dataset has data for EU that changes definitions over time, then we could simply harmonize its name to be EU -> EU (changing over time).

danyx23 commented 1 year ago

@pabloarosado one argument for keeping owid code aligned with what we have right now is that these have made it to our urls - i.e. there are lots of URLs in the wild that have stuff like below (see country=):

https://ourworldindata.org/explorers/air-pollution?uniformYAxis=0&Pollutant=All+pollutants&Fuel=From+all+fuels+%28Total%29&Per+capita=false&country=USA~CHN~IND~GBR~DEU~GRC

We don't have to align those two but if we don't then we need to keep a mapping around and we should have a good reason to break this compatibility

pabloarosado commented 1 year ago

Thanks @danyx23. That's a good point. We could have a mapping for garden and then another switching from garden to grapher. But that's too annoying. So I suppose the best is to use the same kind of codes we were using so far (and also check that they make sense), so region_code will be the iso alpha 3 or an owid code, e.g. ITA and OWID_EUR.

pabloarosado commented 1 year ago

I have created this PR which respects almost everything we discussed above. For minor comments, I suggested handling them directly there, but for major suggestions (e.g. "remove this table") let's continue here.

The main difference with respect to what's described above is that I added a table called related, which is similar to members but including all those cases where the membership is unclear. The idea is that members will be used to create aggregates, but related will be used to perform checks, and raise warnings on possible overlaps (where, after having harmonized and created aggregates, the warnings can be turned off). Does it make sense? I keep changing my mind about this, so I'd appreciate any inputs, thanks!

pabloarosado commented 1 year ago

We now have a new regions dataset! A few smaller issues have emerged, but I'm closing this issue now.