owid / etl

A compute graph for loading and transforming OWID's data

https://docs.owid.io/projects/etl

MIT License

79 stars 21 forks source link

Redefine countries-regions dataset #779

Closed pabloarosado closed 1 year ago

pabloarosado commented 1 year ago

We need to redefine our current countries-regions dataset.

Requirements

We definitely need the list of current countries (which is a static list). We don't need to have all countries that have ever existed.
- Each current country is defined by their current geographical border.
We need a list of aliases of current countries, e.g. "United States" -> "USA", "US", ...
We need to know the current continent of each current country. This is also a static list.
- If we consider only continents, each country would be allocated to just one continent. However, if we include other regions, like EU(27) or OECD, and others, then a country can belong to multiple regions.
We need the population of the current country, and the population of that geographical region for each year in the past and projected into the future. We want this population regardless of whether the country existed a specific year. For example, we want the population of Russia (the geographical region of current Russia) even when it did not exist.
- We also need the population of historical regions. A priori, we could apply the same criterion as with current regions, and have the population (in the past and in the future) of the geographical region that it occupied. However, in practice, we may only need population for the years when the region existed.
We need to map the transitions between geographical regions. For example, we need to know that East Germany and West Germany became Germany in 1990, and that USSR became Armenia, Azerbaijan, ...
We may need to have a special way to identify former regions, like suffix "(former)" or prefix "Former". For example, we would have "USSR (former)" or "Former USSR". In the case of Sudan, we would need "Sudan", and "Sudan (former)" or "Former Sudan".
We need alternative definitions of regions, given by the data providers. For example, "North America (BP)".
- We may need the mapping of these regions to current regions, but it may not be always possible (they may not provide such explicit definitions).
- We may also need alternative definitions of countries. For example, some providers may include Zanzibar in Tanzania and others would not, or define China as "China (mainland)" and "China" (including Taiwan and Macau). In that case, it would be good to have, e.g. "Tanzania mainland (Institution)" and "Zanzibar (Institution)".
We may need different definitions of country codes.
We need income groups (although this already is a separate World Bank dataset).

Issues

How do we deal with transition years? For example, when does the USSR end and when does Russia start?
How do we deal with regions that change borders? For example, USSR is different in 1989 and 1990. We could define, e.g. "USSR (pre-1990)" and "USSR (1990)", but it may be an overkill.

Other considerations

How do we edit this data, manually or through tools (like harmonize)?
Should we consider sub-country regions and their dependencies, e.g. Bavaria < Germany?
Should this (or these) dataset(s) be special, like "reference" o "key_indicators", or just another normal (versioned) ETL dataset?
What sanity checks should we apply to this (or these) dataset(s)?
How do we create region aggregates? Keep in mind that:
- For example, Africa emissions may be the sum of a subset of African countries, but African population probably includes all countries. Therefore, dividing emissions by population would almost always lead to underestimates. So we may need to keep a list of countries involved in each region for each variable.
- Data providers may provide data for USSR and Russia in the same years. We have to ensure we do not count the same region multiple times when creating aggregates.

Related issues

Let's propose alternatives in the comments.

pabloarosado commented 1 year ago

A few general suggestions:

I think that having an owid namespace in ETL is not very useful. We should instead use more descriptive namespaces, like demography.
It's also not very useful to have table names key_indicators and reference. We should instead use more descriptive names.
We should keep versions for all datasets. It's problematic to have a "latest" population dataset, because, when it changes, it affects many downstream datasets. It should be up to us whether a dataset needs to have a version change or not (e.g. if it's a minor change, we may decide to keep the same version).

pabloarosado commented 1 year ago

Proposal 1

Table `region_definitions`

Contains the list of harmonized countries and regions, with their country codes. This dataset is static, so, even if it has a version, it will probably almost never change.

+-------------+--------------+---------------+-----------------+
| region      | owid_code    | region_type   | is_historical   |
+=============+==============+===============+=================+
| Afghanistan | AFG          | country       | False           |
+-------------+--------------+---------------+-----------------+
| France      | FRA          | country       | False           |
+-------------+--------------+---------------+-----------------+
| Russia      | RUS          | country       | False           |
+-------------+--------------+---------------+-----------------+
| Spain       | ESP          | country       | False           |
+-------------+--------------+---------------+-----------------+
| Europe      | OWID_EUR     | continent     | False           |
+-------------+--------------+---------------+-----------------+
| USSR        | OWID_USS     | country       | True            |
+-------------+--------------+---------------+-----------------+
| Andalusia   | OWID_ESP_AND | sub_country   | False           |
+-------------+--------------+---------------+-----------------+

Here I have included region_type and is_historical, but we could include other things, like ISO2, ISO3, etc., although I've never used most of them. We could also include special regions, e.g. "Southern hemisphere" (an entity in the dataset of temperature anomaly), or custom regions defined by institutions, e.g. "Europe (BP)", with code BP_EUR.

Table `region_aliases`

Contains region harmonized name and alias.

+---------------+-------------------+
| region        | alias             |
+===============+===================+
| United States | US                |
+---------------+-------------------+
| United States | USA               |
+---------------+-------------------+
| Italy         | Republic of Italy |
+---------------+-------------------+

Table `region_members`

Contains region harmonized name and a member of that region.

+---------------------+---------------+
| region              | member        |
+=====================+===============+
| North America       | United States |
+---------------------+---------------+
| Europe              | France        |
+---------------------+---------------+
| Europe              | Spain         |
+---------------------+---------------+
| European Union (27) | Spain         |
+---------------------+---------------+
| Europe              | Russia        |
+---------------------+---------------+
| Europe              | USSR          |
+---------------------+---------------+
| Europe              | East Germany  |
+---------------------+---------------+
| Spain               | Andalusia     |
+---------------------+---------------+

Note: Here we would not add "memberships" like ("USSR", "Armenia"), since we consider Armenia was not a part of the USSR, but instead did not exist until the USSR dissolved.

Table `region_transitions`

Contains the harmonized name of a region, the harmonized name of a successor, and the year when the transition occurred.

+--------------+-------------+--------+
| region       | successor   |   year |
+==============+=============+========+
| USSR         | Armenia     |   1991 |
+--------------+-------------+--------+
| USSR         | Azerbaijan  |   1991 |
+--------------+-------------+--------+
| USSR         | Russia      |   1991 |
+--------------+-------------+--------+
| East Germany | Germany     |   1990 |
+--------------+-------------+--------+
| West Germany | Germany     |   1990 |
+--------------+-------------+--------+

Note: We should assert that each historical region has only one transition year. For example, USSR would dissolve in 1991. If we wanted to have one USSR with Lithuania until 1990, and one without, we'd need to define two different USSR's (e.g. USSR (pre 1990) and USSR (1990)), but this may be an over-complication.

pabloarosado commented 1 year ago

To be more specific, we can distinguish two versions of the previous proposal:

Proposal 1a: We have different datasets, namely region_definitions, region_aliases, region_members, and region_transitions, each one with one table with the same name.
Proposal 1b: We have just one dataset, called regions, which contains different tables, namely definitions, aliases, members, and transitions.

I think I'd prefer Proposal 1b, although this implies we need to be a bit more careful deciding what a minor and a major update should be. For example, adding an alias should not imply a new version. But maybe adding a new historical transition would be.

pabloarosado commented 1 year ago

Proposal 2

All information (on definitions, aliases, members and historical transitions) is packed into just one dataset with one table, called simply regions.

+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| country       | owid_code   | region_type   | end_year   | aliases                        | members                    | successors                       |
+===============+=============+===============+============+================================+============================+==================================+
| United States | USA         | country       | <NA>       | ['United States', 'US', 'USA'] | []                         | []                               |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| Italy         | ITA         | country       | <NA>       | ['Republic of Italy']          | []                         | []                               |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| USSR          | USS         | country       | 1991       | ['Soviet Union']               | []                         | ['Armenia', 'Azerbaijan', '...'] |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+
| Europe        | OWID_EUR    | continent     | <NA>       | []                             | ['Spain', 'France', '...'] | []                               |
+---------------+-------------+---------------+------------+--------------------------------+----------------------------+----------------------------------+

This may be more convenient for maintenance (although, like in Proposal 1b it may require to be careful deciding what a minor and major update should be). Additionally, given that it could fully contain all columns in our current countries_regions.csv table, it would make the refactor a bit easier.

A downside of this, compared to Proposal 1 is that, if we add a sub-country region, e.g. "Andalusia", we also need to edit previous entries accordingly, e.g. "Spain", "Europe", "European Union" would now have "Andalusia" added to the list of members. On the other hand, in Proposal 1 we would just add a new row to region_members.

pabloarosado commented 1 year ago

Regardless of the chosen proposal for primary tables on region definitions, aliases, members and transitions, we would have:

Other primary tables

Like currently, we would also have a table for region_population and region_area. They would contain region harmonized name, year, and population/area. And it can also include estimates of certain regions into the past or future, like Russia going back before 1991, or World going into the future.

Other derived tables

Having just the previous definitions of primary tables would let us do all our usual operations, like country name harmonization, or region aggregates, "on the flight" (i.e. without needing to define derived tables anywhere).

For example, imagine we have a dataframe with rows for France, Spain, USSR, and Russia, for 1990, 1991, and 1992. Then if we wanted to build the aggregate for Europe, our function to create region aggregates could (from minimum requirements to nice-to-have):

Load that dataframe and the region_transitions table, and raise a warning saying that Historical region USSR overlaps with successor Russia on years [1990, 1991, 1992], and sum data for France, Spain and Russia for 1990, 1991, and 1992. For this, we don't even need to know the transition year (although it's good to keep it to check that regions are well-defined). The function would also ingest an optional dictionary (e.g. known_overlaps) that would check for expected overlaps, and ignore warnings for them (for each overlap, we may also need to know whether we want to add the contribution of both regions, e.g. for transition years, or just of one of them).
Same as 1, but also prioritise successors' data after the transition year, and historical region prior to that year, so it would add data for France, Spain and USSR for 1990 and 1991, and France, Spain and Russia for 1991.
Same as 1 and 2, but also load region_members and raise a warning if there's an overlap between country and sub-country regions (e.g. Spain and Andalusia). But this may be an over-complication, at least for now.
Output an additional dataframe that contains, for each region, year, and variable, the list of countries that were informed. This could be stored as an auxiliary table in the dataset, that would let us decide how to compute, e.g. per capita variables (dividing only by the population of informed countries for each variable).

lucasrodes commented 1 year ago

Hi @pabloarosado, thanks for adding so much value to the discussion with your comments and proposals.

Overall, I have a preference for proposal 1. I think it is more flexible and can adapt to many more use cases.

IMHO, we should use owid_code in all tables instead of the region names themselves. If we were to change some name in the future (e.g. Timor → Timor-Leste) this should be simple: only change one entry in region_definitions (and not in all tables).

Find some of my thoughts below.

On proposal 1

`region_definitions`

In this table, I think we should only have columns (or fields) that make sense for all "types of regions". All the ones that you specify make total sense, IMHO. But perhaps I'd leave out others, such as ISO-specific standards, etc.
On owid_code, I think we could spend some time thinking about how we create these:
- Would be nice if they were self-explanatory when a person reads them. That is, someone can see if that row refers to a country, a subdivision or something defined by an organisation (OWID or BP).
- For non-owid defined regions, I would use a prefix such as EXT. So for the example for "Europe (BP)" I would use something like EXT_BP_EUR.
We could have an additional column, something like "defined_by" that specifies who defined that region.
- In some cases, it could be a bit complex. Like, who defined the borders of India? I guess we would default to the UN.
- It's easier for cases like continents, where we sometimes use OWID's definitions or other org's (UN, WB, etc.). This could be cleaner than having the organisations hardcoded into the region's name.

+-------------+--------------+---------------+-----------------+------------+
| region      | owid_code    | region_type   | is_historical   | defined_by |
+=============+==============+===============+=================+============+
| Afghanistan | AFG          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| France      | FRA          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Russia      | RUS          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Spain       | ESP          | country       | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Europe      | OWID_EUR     | continent     | False           |    OWID    |
+-------------+--------------+---------------+-----------------+------------+
| Europe      | EXT_BP_EUR   | continent     | False           |     BP     |
+-------------+--------------+---------------+-----------------+------------+
| USSR        | OWID_USS     | country       | True            |     UN     |
+-------------+--------------+---------------+-----------------+------------+
| Andalusia   | OWID_ESP_AND | sub_country   | False           |     UN     |
+-------------+--------------+---------------+-----------------+------------+

`region_aliases`

This table would be perfect as you present it.

`region_members`

This one is challenging, as it would be a many-to-many table. This does not make it necessarily a bad choice, just that people should be aware of this when working with it.

`region_transitions`

LGTM

additional tables

We could consider adding extra tables:

ISO codes: A list with ISO codes. Perhaps mapping ISO codes to OWID codes?
Region types: As long as we have a field called region_type in table region_definitions, I would have a table that lists all of these. Perhaps we could have codes (e.g. regionTypeId) and reference these instead of the names from region_definitions.
Region periods: A table with the regions' births and deaths. E.g. USSR from 1922 to 1991.

Additional comments

Overall, I agree that Proposal 1b makes more sense. As we change something in one table, we need to run sanity checks on others to ensure the whole set of tables is consistent. Therefore, keeping all these tables under the same dataset would be nice. It is easier and cleaner to ensure consistency across a set of tables within a dataset than across a set of datasets.

As I see with this structure, to obtain the list of countries for a continent for a specific year, we would need to:

Join table region_members with region_definitions to get the list of regions that ever belonged to a continent.
We need to join this with region_transitions to see which regions no longer belong in the continent.
Obtain the list per year per continent using python.

Right?

While it is true that I prefer this structure, we need solid and robust sanity checks to make sure that they are consistent.

JoeHasell commented 1 year ago

Hey @pabloarosado, one small thought to throw in.

It wasn’t clear to me exactly how ‘complete’ a mapping you have in mind with region_transitions.

But I just wanted to flag – having looked into it a couple of times – that this is a bit of a can of worms. Saying when a state existed and when it didn’t, when a state was actually a component of another state etc. is quite a fuzzy thing. In social science research, people rely on the efforts of researchers who stick their necks on the line to come up with a ‘state system’. Gledistch and Ward is one such system that gets used, but there are others. And they can disagree, or result in counter-intuitive classifications because they're trying to use some fixed definition. Bastian has also had the idea of coming up with OWID maintained set of historical boundaries. It's possibly something we could do. But I just wanted to flag that it would be quite a big research undertaking – the kind of thing that people publish in journal articles. And overall I am a bit pessimistic that this is a good use of our time.

Moreover, even if we were to come up with or adopt a ‘state system’, there is always the ambiguity that we do not know if the providers of a given dataset we are using have the same state system in mind. Or, more likely, if they have no particular state system in mind at all, or their data is a funny mix of different elements from various geographical boundaries.

Just as one example of the mess: the WID data on inequality that Pablo A is working on, for say Germany. To find out what ‘Germany’ means here you have to dig up the accompanying research paper (and there isn’t even a clear mapping of countries to research papers). There you find that their definition of Germany is using ‘prevailing borders’ (without defining what that means). I would say that's a good example of documentation. Often you wouldn't even get that much.

As I say, I'm not sure exactly what you have in mind with it, so maybe these thoughts aren't so relevant. Just wanted to flag the very messy nature of it in my experience, in case that's helpful context.

pabloarosado commented 1 year ago

Hi @lucasrodes thanks for the suggestions. I think I agree with you all your suggestions. What I'm not sure about is whether 1b or 2 is better (I agree that 1a is inferior for the reasons you said).

We can easily implement as many tests in 1b as in 2, since it's all going to be run on the same data step. I think that, if these were tables in a database, 1b would be much preferable. But in practice, we are talking about relatively small csv files, so it feels convenient to have all we need related to regions in just one place (aka Proposal 2). But making changes and reviewing them would be much clearer in 1b. So I'm not sure, it would be good to have some more opinions to decide.

pabloarosado commented 1 year ago

Hi @JoeHasell thanks for pointing out these issues. Please note that the region_transitions table, in principle, is only needed when creating region aggregates. Other than that, I don't think we'll be using it much, and as a general rule, we'll always adopt the criteria of the individual data providers.

In other words, we will always show data as given by the original source (for example, if they extend Russia or Germany to many years back before 1990, we'll show exactly that). But when we construct aggregated data for, e.g. "Europe", we want to stick to some definitions, because we need to ensure that:

We are not double-counting regions (e.g. USSR and Russia, or East/West Germany and Germany).
"Europe" means the same set of countries in all our charts on our site.

So, having a table with reasonable definitions of transitions and successor countries would be helpful, even if it's not totally historically accurate. The same applies when constructing aggregates for income groups. And, if a specific dataset has very unusual definitions and it's absolutely incompatible with our definitions, then we would simply not build aggregates for that dataset. Does it sound more reasonable now? Thanks.

danyx23 commented 1 year ago

Great write-up, thanks a lot! I think I lean towards 1b. In theory we could also consider using a small SQLite db for keeping the various tables in sync with each other "for free" (i.e. just with using foreign key constraints) but that is probably too different to what we have so far in the catalog.

I agree that for the region_transitions it makes a lot of sense to be very pragmatic and basically view this through the lens of "what data do we have where historic regions are important" and try to create a sensible solution for that. I would guess that for the EU this would also be useful.

I'll digest this a bit and then will probably have some more comments here or in an applicable call.

Marigold commented 1 year ago

I don't have anything creative to add. Just perhaps that having the definitions stored as YAML and then creating your tables on the fly might be more flexible (e.g. you could have "added" or "removed" members for periods) and easier to manage than CSV files. It could look something like this (data is most likely wrong).

- country: France
  iso: FRA

- country: Czechia
  iso: CZE
  aliases: [Czech Republic]

- country: Czechoslovakia
  iso: CSK
  periods:
    - years: 1918-1992
      members: [CZE, SVK]
  aliases: [Former Czechoslovakia]

- country: Russia
  iso: RUS
  aliases: [Russian Federation]
  periods:
    - years: 1922-1992
      members: [SU]
    - years: 1992-

- country: USSR
  iso: SU
  periods:
    - years: 1922-1991
      members: [CSK]

- country: Europe
  iso: OWID_EUR
  periods:
    - members: [ALA, ALB, AND, AUT, OWID_AUH, OWID_BAD, OWID_BAV, BLR, BEL, BIH, BGR, OWID_CIS, HRV, CZE, OWID_CZS, DNK, OWID_GDR, EST, FRO, FIN, FRA, DEU, GIB, GRC, GGY, OWID_HAN, OWID_HSE, OWID_HSG, HUN, ISL, IRL, IMN, ITA, JEY, OWID_KOS, LVA, LIE, LTU, LUX, MLT, OWID_MEC, OWID_MOD, MDA, MCO, MNE, NLD, MKD, NOR, OWID_PMA, POL, PRT, ROU, RUS, SMR, OWID_SAX, SRB, SVK, SVN, ESP, SJM, SWE, CHE, OWID_TUS, OWID_SIC, UKR, GBR, VAT, OWID_GFR, OWID_WRT, CYP, OWID_USS, OWID_SRM, OWID_TRS, OWID_YGS, OWID_SEK]

- country: European Union (27)
  iso: OWID_EU27
  aliases: [EU, EU-27, EU27]
  periods:
    - years: 1995-2003
      members: [AUT, BEL, BGR, HRV, CYP, CZE, DNK, EST, FIN, FRA, DEU, GRC, HUN, IRL, ITA, LVA, LTU, LUX, MLT, NLD, POL, PRT, ROU, SVK, SVN, ESP, SWE]
    - years: 2004-
      added_members: [CZE]

pabloarosado commented 1 year ago

Thanks everyone. I'll try to put together here all the previous suggestions (plus Max's comment saying that some country names are too long and make Marimekko charts worse).

Proposal 3

As @Marigold suggested, all data could be handled in a single (possibly big) yaml file adjacent to the data step. The generated dataset (called regions) in the demography namespace would contain the following tables:

MINOR UPDATE:

Changed region name USSR -> Former USSR, because all historical regions should start with "Former" (as required in this issue).
Renamed owid_code -> region_code, member -> member_code, successor -> successor_code.

Table `definitions`

Changes with respect to 1b:

Index is now region_code. They all start with owid, unless they are definitions of a different institution (e.g. bp_eur for BP's definition of Europe).
Added short_name, to be used in Marimekko charts, and possibly other places that are short of space. @danyx23 do you think this is a good solution?
Added defined_by which is the institution using a given definition.

+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| region_code  | name                         | short_name   | region_type   | is_historical   | defined_by   |
+==============+==============================+==============+===============+=================+==============+
| owid_afg     | Afghanistan                  | Afghanistan  | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_fra     | France                       | France       | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_rus     | Russia                       | Russia       | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_esp     | Spain                        | Spain        | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_eur     | Europe                       | Europe       | continent     | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_uss     | Former USSR                  | USSR         | country       | True            | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_esp_and | Andalusia                    | Andalusia    | sub_country   | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| bp_eur       | Europe (BP)                  | Europe (BP)  | continent     | False           | bp           |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+
| owid_drc     | Democratic Republic of Congo | DR Congo     | country       | False           | owid         |
+--------------+------------------------------+--------------+---------------+-----------------+--------------+

Table `aliases`

Changes:

Index is now region_code.
Since the official name of a region can change in definitions, we add the official name as one of the aliases.

+-------------+-------------------+
| region_code | alias             |
+=============+===================+
| owid_us     | United States     |
+-------------+-------------------+
| owid_us     | US                |
+-------------+-------------------+
| owid_us     | USA               |
+-------------+-------------------+
| owid_ita    | Italy             |
+-------------+-------------------+
| owid_ita    | Republic of Italy |
+-------------+-------------------+

Table `members`

Changes:

Index is now region_code.
Member is now given in terms of region codes.

+-------------+--------------+
| region_code | member_code  |
+=============+==============+
| owid_nam    | owid_usa     |
+-------------+--------------+
| owid_eur    | owid_fra     |
+-------------+--------------+
| owid_eur    | owid_esp     |
+-------------+--------------+
| owid_eu27   | owid_esp     |
+-------------+--------------+
| owid_eur    | owid_rus     |
+-------------+--------------+
| owid_eur    | owid_uss     |
+-------------+--------------+
| owid_eur    | owid_egr     |
+-------------+--------------+
| owid_esp    | owid_esp_cat |
+-------------+--------------+

Table `transitions`

Changes:

Index is now region_code.
Successor is given in terms of region codes.

+---------------+------------------+--------+
| region_code   | successor_code   |   year |
+===============+==================+========+
| owid_uss      | owid_arm         |   1991 |
+---------------+------------------+--------+
| owid_uss      | owid_aze         |   1991 |
+---------------+------------------+--------+
| owid_uss      | owid_rus         |   1991 |
+---------------+------------------+--------+
| owid_egr      | owid_ger         |   1990 |
+---------------+------------------+--------+
| owid_wgr      | owid_ger         |   1990 |
+---------------+------------------+--------+

Additional table `legacy`

We could add here all the codes (including wikidata urls) that are in the current countries_regions.csv file, although I don't know if we ever use most of them.

+-------------+--------------+--------------+
| region_code | iso_alpha2   | iso_alpha3   |
+=============+==============+==============+
| owid_ita    | IT           | ITA          |
+-------------+--------------+--------------+
| owid_uss    | SU           | USS          |
+-------------+--------------+--------------+

Comments

[x] @lucasrodes I haven't implemented your suggestion about region type ids. Do you have a strong reason why we should add that abstraction?
[x] @lucasrodes regarding region start and end year, if those are fixed quantities, then shouldn't they be added to definitions? I'm also not sure if we need them. Most of the start years will be left empty to begin with (it would be quite complicated to come up with a start year for all regions, in the same way that it's tricky to find transition years). Possibly what we have in transitions is just enough. Or can you think of a reason why we definitely need a start_year?

lucasrodes commented 1 year ago

LGTM!

Minor amends:

In table members, I'd change column name member to member_code.
In table transitions, I'd change column name successor to successor_code.
I think it can make sense to have legacy table. We could also have owid_code_legacy, which are the codes that we have been using so far (without owid_ for countries with an ISO code).

On your questions

@lucasrodes I haven't implemented your suggestion about region type ids. Do you have a strong reason why we should add that abstraction?

Not really. I thought it could be nice if we ever wanted to change region-type names. But that seems unlikely. No strong reason.

@lucasrodes regarding region start and end year, if those are fixed quantities, then shouldn't they be added to definitions? I'm also not sure if we need them. Most of the start years will be left empty to begin with (it would be quite complicated to come up with a start year for all regions, in the same way that it's tricky to find transition years). Possibly what we have in transitions is just enough. Or can you think of a reason why we definitely need a start_year?

I am not sure if I'd put it in the definitions table, as (i) several entities would have NaNs and (ii) a single entity could appear more than once (e.g. a country is born, then occupied for 100 years, and re-born?).

I think that this does not look super urgent, and I am probably biased because I am working on the History of War project (https://github.com/owid/owid-issues/issues/443). We can think a bit more on this on a separate issue and proceed with what you have proposed.

danyx23 commented 1 year ago

Short names make sense and a column like that is a good solution IMHO.

For the owid codes - for downstream users it might be nicer if we stay with using iso alpha-3 3 letter codes where our definition is the same as the iso alpha one, don't you think? Then you could join the bulk of our data easily and only special entities like country groupings, historic entities etc would have to have additional matching steps.

How should we go about creating owid codes for changing entities like the EU? Should we have two Eu definitions, like this?

OWID_EU_CURRENT - the "current" composition of the EU
OWID_EU_CHANGING - for every year membership can change (this means that members table would have to be aware of this as well)

Or should we have several definitions like OWID_EU_1973_1980, OWID_EU_1981_1985. Datasets that don't bother with the changing composition could then reference e.g. OWID_EU_2021. (ah, how do you deal with the latest bracket?)

We could also combine both and group these year spanning entities together over time into OWID_EU. Let's chat about this in the call.

pabloarosado commented 1 year ago

Hi @lucasrodes just a minor clarification on start years. If a region disappears and appears again, then the new one should be considered a different region. This is similar to the example I mentioned before about the USSR: Lithuania and a few countries left in 1990, and others left in 1991. We currently simplify the definition by saying that all countries left in 1991. But otherwise, we would need to define owid_uss_1922_1990 (named, e.g. USSR (pre 1990)) and owid_uss_1990_1991 (named, e.g. USSR (1990)). Given this, then we still don't need a start year. If a country is succeded by another country, then it suffices to know the end year of the former country.

pabloarosado commented 1 year ago

Hi @danyx23,

I agree that the three letters in our region code should coincide with ISO alpha 3 when possible (I made up most of the codes above ^^). But what I'm not sure about is whether we should use, e.g. ITA and OWID_EU, or OWID_ITA and OWID_EU. I think the latter option is more consistent (given that we could have also codes for other institutions, like BP_EUR). The only downside is that one can't merge our region codes with ISO codes directly, but this is an infrequent case (for which we have the additional codes table). Do you agree?
I think OWID_EU should be the current definition (so, no need to specify OWID_EU_CURRENT).
I think OWID_EU_CHANGING is a bit problematic (since it would be a different entity on different years, similar to the USSR in my example above). The solution could be indeed to have region codes owid_eu_1973_1980, owid_eu_1980_1985, etc., all of them with the same display name, that could be EU (changing over time). Does that make sense?

Related to the last point, we are always assuming that, when harmonizing country names in a dataset, we are simply changing their names to our names, e.g. EU -> European Union (27) (disregarding region codes). This is what we currently do anyway. But we don't need to map their regions to our region codes. So, if a certain dataset has data for EU that changes definitions over time, then we could simply harmonize its name to be EU -> EU (changing over time).

danyx23 commented 1 year ago

@pabloarosado one argument for keeping owid code aligned with what we have right now is that these have made it to our urls - i.e. there are lots of URLs in the wild that have stuff like below (see country=):

https://ourworldindata.org/explorers/air-pollution?uniformYAxis=0&Pollutant=All+pollutants&Fuel=From+all+fuels+%28Total%29&Per+capita=false&country=USA~CHN~IND~GBR~DEU~GRC

We don't have to align those two but if we don't then we need to keep a mapping around and we should have a good reason to break this compatibility

pabloarosado commented 1 year ago

Thanks @danyx23. That's a good point. We could have a mapping for garden and then another switching from garden to grapher. But that's too annoying. So I suppose the best is to use the same kind of codes we were using so far (and also check that they make sense), so region_code will be the iso alpha 3 or an owid code, e.g. ITA and OWID_EUR.

pabloarosado commented 1 year ago

I have created this PR which respects almost everything we discussed above. For minor comments, I suggested handling them directly there, but for major suggestions (e.g. "remove this table") let's continue here.

The main difference with respect to what's described above is that I added a table called related, which is similar to members but including all those cases where the membership is unclear. The idea is that members will be used to create aggregates, but related will be used to perform checks, and raise warnings on possible overlaps (where, after having harmonized and created aggregates, the warnings can be turned off). Does it make sense? I keep changing my mind about this, so I'd appreciate any inputs, thanks!

pabloarosado commented 1 year ago

We now have a new regions dataset! A few smaller issues have emerged, but I'm closing this issue now.

owid / etl

Redefine countries-regions dataset #779

Requirements

Issues

Other considerations

Related issues

Proposal 1

Table region_definitions

Table region_aliases

Table region_members

Table region_transitions

Proposal 2

Other primary tables

Other derived tables

On proposal 1

region_definitions

region_aliases

region_members

region_transitions

additional tables

Additional comments

Proposal 3

Table definitions

Table aliases

Table members

Table transitions

Additional table legacy

Comments

Table `region_definitions`

Table `region_aliases`

Table `region_members`

Table `region_transitions`

`region_definitions`

`region_aliases`

`region_members`

`region_transitions`

Table `definitions`

Table `aliases`

Table `members`

Table `transitions`

Additional table `legacy`