vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 84 forks source link

Inconsistency in the iso codes in `codelist_panel`? #317

Closed smoser11 closed 1 year ago

smoser11 commented 1 year ago

I don’t know if this is an ‘issue’ so much as it is a few questions and a possible inconsistency. The short version is: how are codelist_panel and codelist created? The former is a panel (with country.name.en and year uniquely identifying each row), the latter has no temporal information. Further, there are countries identified in codelist that don’t appear in codelist_panel.

Here are two sub-questions, and one observation that I believe demonstrates an inconsistency.

  1. Moreover, how are missing values in codelist_panel arrived at? In some cases it seems obvious: e.g. Polity IV and COW provide county X year data. But others are not (e.g. Gleditsch & Ward codes, or ISO codes). In particular, how are the missing values for e.g. iso3c arrived at? In fact, how can there even be missing values over time in the iso codes? From https://www.iso.org/obp/ui/#search/code/ a country either does have an ‘official assigned code’ (e.g.iso2c) at a particular time or it doesn’t. (Though I note that this list can change over time, e.g. there are ‘Formerly used’ codes.)

  2. I think there is an inconsistency in the iso3c codes. Here is a MWE.

    ## a MWE
    rm(list=ls())
    library(countrycode)
    library(dplyr)
    library(ggplot2)
    library(panelView)  # install.packages("panelView")
    data("codelist_panel")
    ## Sudan(SDN)/ South Sudan(SSD) example, both in iso3c
    codelist_panel %>% filter(iso3c == 'SSD' |  iso3c == 'SDN') %>% 
      panelview( iso3c ~1, index = c("country.name.en", "year"),  type = 'miss', axis.adjust = TRUE)
    ![SSiso](https://user-images.githubusercontent.com/3680484/188210242-ae6bbdc9-935a-4aee-b2da-e3cd1add5828.png)

    South Sudan has missing values prior to 2011 because it did not exist as a separate country, so the top part of the graph makes sense. But we see an iso3c code for Sudan in every year. This does not make sense, by the same logic as before: it didn’t obtain independence until 1956, but has values for isoc3 going back to 1900. I note that not all other country codes (CCs) suffer from this. For example cown does not seem to suffer from this:

    # Sudan(SDN)/ South Sudan(SSD) example, both in iso3c
    codelist_panel %>% filter(iso3c == 'SSD' |  iso3c == 'SDN') %>% 
        panelview( cown ~1, index = c("country.name.en", "year"),  type = 'miss', axis.adjust = TRUE)
    ![SScow](https://user-images.githubusercontent.com/3680484/188210353-653acb71-f419-418f-864c-c2b00b2b07e0.png)

    In this example both series only run from 1956 to present, with the values of cown being missing for South Sudan prior to their independence in 2011.

  3. codelist_panel is not a balanced panel – which is perfectly fine. But my question is: how are the start and end dates for a country determined? For example, Afghanistan has rows for the entire time-span (1789-2021), and (of course) some values are missing (e.g. p4n until 1800). But the series for Albania only starts in 1912. How are these ‘start dates’ (and end dates) determined? My guess is that the first year a country gets a row in codelist_panel is the minimum year for which any of the CCs used in countrycode is not missing. Is this correct?

Here is the general question:

How does (or even can?) one create a panel of country codes such as codelist_panel? This panel contains many (an impressive number!) of country codes from different sources, in many different formats. Some CCs are already panels, e.g. PolityIV. Some CCs are ‘dynamic’ in that at any point in time the country codes are static, but this list changes overtime (like ISO codes that can be ‘transitioned’ out of use, or IMF codes that are listed monthly and hence have the potential to change). Some are just a static list of country codes, such as Gleditsch & Ward codes (this might not be technically correct but hopefully the general point is clear). How can/are CCs that treat time differently in their coding reconciled in a single panel?

I’m super-sorry if this has been answer/ addressed elsewhere. I did search the GitHub and read up as much as I could, but I am still very confused about the strange patterns of mising-ness for several (most) country codes. Thanks in advance!

vincentarelbundock commented 1 year ago

Thanks for your interest in countrycode.

The entire build code is public: https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/build.R

As you noted, some organizations publish country-year codes, while others publish cross-sectional codes. The codelist_panel includes rows for every country-year unit which is included by at least one of our source organizations. Then, we left-join the cross-sectional codes.

In the case of Sudan, the V-Dem organization includes country-years since 1900, so those country-years are included in the panel. See the country-year dataset here: https://www.v-dem.net/vdemds.html

cjyetman commented 1 year ago

The answer of "how" is precisely defined here https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/build.R

I'll let @vincentarelbundock detail any conceptual decisions made in there.

There are potentially more than one one year/s relevant to any given country code:

As you've pointed out, some country codes do not care about time, they simply assign a code to a country, so it's questionable whether one should consider a code from that set for a given country to be valid always, only when one believes that country should be seen as independent, or otherwise.

vincentarelbundock commented 1 year ago

great timing.

smoser11 commented 1 year ago

Thanks very much @vincentarelbundock and @cjyetman !! Not only for your contributions, but for your super-speedy reply! After reading build.R in some detail, I think I have a much firmer grip on what you all are doing in this project. Thank you for that. I do have some follow-up questions, mostly regarding some design choice decisions.

  1. When merging the panel country codes (CCs) - in the pan object -- with the cross-sectional CCs -- in the cs object in lines 122-126 of build.R, each country-year gets a value of the cross-sectional CCs if and only if that country-year appears in pan. That is, the C-S CCs have the same pattern of missingness as it does in the panel. Why is this? It can lead to some strange (at least to me) results, for example the eurostat CC for Algeria is missing from 1831 - 1899 but present before and after. I didn't see the logic here. Alternatively, the C-S CCs could be used in all years. But this has to do with how one thinks the C-S CCs should be 'stretched' into a panel.
  2. When determining the country.name.en values in lines 112-119 in build.R: It appears the English name for a country is given preference first to 'cldr.name.en' then to 'iso.name.en', etc. Why is this? My hunch is that it has to do with the degree of missingness of these variables(?). Am I correct that a different ordering of priority would (could?) give rise to different values of country.name.en?
  3. Lines 96-106 in build.R: the last year a country code (CC) is given for a country and changes some of them (e.g. p4n, vdem, etc.). Why are the last (per each country's series) values made missing? In particular, why is the vdem code for Czechoslovakia made missing in 1992? It appears to be changed from the vdem code of 157 to NA, if I am reading this right (?) I think the answer is that these are manual fixes for some codes. For example, while the vdem country_name, country_text_id, and country_id are the same for CZE, histname changes from Czechoslovakia in 1991 to Chech Republic in 1992.
  4. Lastly, why are the panel CCs 'padded out' using ExtendCoverage() to 2020 (line 74)? After all, there really aren't COW codes past 2016, simply because the latest version of the State Membership System (https://correlatesofwar.org/data-sets/state-system-membership/) only goes up to 2016.

My apologies if this is not the appropriate forum for asking about these. And thank you again (and again!) for all your work and for this extremely interesting project! Thanks also for making all the 'build' and 'dictionary' material open-source so that others can fork the git and customize e.g. codelist_panel.

vincentarelbundock commented 1 year ago
1. When merging the panel country codes (CCs) - in the `pan` object -- with the cross-sectional CCs -- in the `cs` object in lines 122-126 of `build.R`, each country-year gets a value of the cross-sectional CCs if and only if that country-year appears in `pan`.  That is, the C-S CCs have the same pattern of missingness as it does in the panel.  Why is this?

The original idea was that codelist_panel should include all country-years covered by at least one organization, and treat all other country-years as non-existent. Also, if users want a truly rectangular dataset, it is trivial to: expand.grid(iso3c = codelist$iso3c, year = 1900:2024)

2. When determining the `country.name.en` values in lines 112-119 in `build.R`:  It appears the English name for a country is given preference first to 'cldr.name.en' then to 'iso.name.en', etc.  Why is this?  My hunch is that it has to do with the degree of missingness of these variables(?).  Am I correct that a different ordering of `priority` would (could?) give rise to different values of `country.name.en`?

Yes. country.name.en should never be used to merge anything, because it is not a formal standardized column. It is a convenience entry that combines other country names in an arbitrary order of priority, to make sure that we have at least one column where all units have an English name.

3. Lines 96-106 in `build.R`:  the last year a country code (CC) is given for a country and changes some of them (e.g. `p4n`, `vdem`, etc.).  Why are the last (per each country's series) values made missing?  In particular, why is the vdem code for Czechoslovakia made missing in 1992?  It appears to be changed from the vdem code of 157 to `NA`, if I am reading this right (?)  I _think_ the answer is that these are manual fixes for some codes.  For example, while the vdem country_name, country_text_id, and country_id are the same for CZE, histname changes from Czechoslovakia in 1991 to Chech Republic in 1992.

This is a somewhat arbitrary choice, made in dictionary/get_vdem.R. The goal is to ensure that there are no duplicate country-years in codelist_panel. We tried to make these choices based on date of year when the geopolitical changes occur, but I'm sure there is room to improve. Feel free to open a separate issue if you would like to challenge one of the "tie-breakers" in a get_*.R file. Corrections are most welcome!

4. Lastly, why are the panel CCs 'padded out' using `ExtendCoverage()` to 2020 (line 74)?  After all, there really aren't COW codes past 2016, simply because the latest version of the State Membership System (https://correlatesofwar.org/data-sets/state-system-membership/) only goes up to 2016.

It's a balance between convenience and correctness. Many researchers who use CoW codes are likely to want to merge in post-2016 data, and (nearly?) all countries which existed in 2016 still exist today. 2020 is arbitrary and should be updated.

Again, feel free to propose specific changes if you feel they would improve the package.

vincentarelbundock commented 1 year ago

Closing now, as the conversation seems to have dried up without specific proposals. As I said above, feel free to open new issues with specific recommendations for change.

smoser11 commented 1 year ago

Many thanks! Great project and thanks again for your kind help and information.

Best, scott

On Sun, Sep 25, 2022 at 11:18 AM Vincent Arel-Bundock < @.***> wrote:

Closing now, as the conversation seems to have dried up without specific proposals. As I said above, feel free to open new issues with specific recommendations for change.

— Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/countrycode/issues/317#issuecomment-1257162569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4CRZBBQMEXIN3ISOX2YFTWAARHRANCNFSM6AAAAAAQDPDVKQ . You are receiving this because you authored the thread.Message ID: @.***>