vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
343 stars 83 forks source link

Russia regions get misclassified as iso3c: RUS in WID data #360

Closed morrisseyj closed 2 days ago

morrisseyj commented 3 days ago

Thanks for maintaining this extremely useful package.

I have run into an issue when using countrycode() on the World Inequality Database (https://wid.world/data/) data.

Two regional categories reported in the WID data are:

  1. "Other Russia & Central Asia"
  2. "Russia & Central Asia"

This is in addition to country labels:

  1. "the Russian Federation"
  2. "the USSR"

countrycode() returns all values for Russia as follows:

countrycode(c("Other Russia & Central Asia", "Russia & Central Asia", "the Russian Federation", "the USSR"), 
            "country.name", 
            "iso3c")

Returns: [1] "RUS" "RUS" "RUS" "RUS"

This is what I would expect for "the USSR" and "the Russian Federation", as the dates usually separate these two and its useful to categorize them together in any time series. That said for the other two I would expect NA, so that i can filter and join data without mixing regional and country data.

You can find the country codes for the WID data here: https://wid.world/codes-dictionary/#country-code

You can get the .csv for the country codes by:

I am not sure of the best fix for this. The WID data includes regional reporting which causes the issue i highlight above. You could handle this by first filtering before selecting:

read.csv('~/R/WID_data/wid_all_data/WID_countries.csv', sep = ';') %>% 
  filter(region != "") %>%
  select(alpha2, shortname)

I note that the code book for the country data states: "The two-letter country codes used in WID mostly follow the ISO 3166-1 alpha-2 standard. The list has however been amended to include world regions, country subregions, former countries and countries not officially included in the standard." I am not sure whether this warrants the WID data getting its own origin codelist? It's certainly an impressive data set.

Let me know if there is preferred way to handle this. I'd be happy to look into the process for contributing to this project, either by creating a codelist or amending the regex search to provide what i think is anticipated behavior: returning NA for regional references. In the latter case my concern would be introducing a regex change that breaks something else.

vincentarelbundock commented 3 days ago

Thanks for reporting.

This is a tricky issue. The regular expressions were designed to convert country names, not region names. This allows us to make some assumptions, but it can create problems for unintended applications like this one.

I don't think we want to add a new code specifically for this.

You could design a custom dictionary and the use that in the future. That might be the cleanest way forward. See the documentation.

I guess it might be possible to make the Russia regex less aggressive, but i'd be very wary of doing that, for fear that we generate new false negatives.

Tricky!

morrisseyj commented 2 days ago

Ok, thanks very much.