vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
346 stars 84 forks source link

UN Population Division region names #253

Closed guyabel closed 3 years ago

guyabel commented 3 years ago

Would it be possible to add a dictionary for the UN population division region names and codes? These differ fro the UN regions already in codelist$un.region.name, most notably they separate the Americas into Northern America and Latin America and the Caribbean. They have been doing this for at least the last 10 years in both their World Population Prospects (WPP) and International Migrant Stock data sets. Their codes are usually given in an excel spreadsheet on each release of the WPP, see for example Locations.xlsx here: https://population.un.org/wpp/Download/Metadata/Documentation/ - column Q for the region names.

I couldn't quite figure out from your README how to do this for regions and when the data online is in a excel spreadsheet.

Perhaps some of the other categories in that spreadsheet might also be of use for the countrycode package, such as the SDG regions and the UN and World Bank developmental groups?

vincentarelbundock commented 3 years ago

What is the relationship to un.regionsub.name and un.regionintermediate.name?

 table(countrycode::codelist$un.regionsub.name)
#> 
#>       Australia and New Zealand                    Central Asia 
#>                               6                               5 
#>                    Eastern Asia                  Eastern Europe 
#>                               7                              10 
#> Latin America and the Caribbean                       Melanesia 
#>                              52                               5 
#>                      Micronesia                 Northern Africa 
#>                               8                               7 
#>                Northern America                 Northern Europe 
#>                               5                              16 
#>                       Polynesia              South-eastern Asia 
#>                              10                              11 
#>                   Southern Asia                 Southern Europe 
#>                               9                              16 
#>              Sub-Saharan Africa                    Western Asia 
#>                              53                              18 
#>                  Western Europe 
#>                               9

table(countrycode::codelist$un.regionintermediate.name)
#> 
#>       Caribbean Central America Channel Islands  Eastern Africa   Middle Africa 
#>              28               8               2              22               9 
#>   South America Southern Africa  Western Africa 
#>              16               5              17

table(countrycode::codelist$un.region.name)
#> 
#>   Africa Americas     Asia   Europe  Oceania 
#>       60       57       50       51       29

Created on 2020-11-13 by the reprex package (v0.3.0.9001)

guyabel commented 3 years ago

Looks like the Population Division are taking the un.regionsub.name in place of un.region.name for countries in the Americas (un.region.name == "Americas")

vincentarelbundock commented 3 years ago

Thanks for looking into this!

Since it looks like a simple variation on existing codes, I'm not 100% convinced we should include yet another column in the dictionary. For instance, one could easily create a custom dictionary and do this:

library(countrycode)

cd <- data.frame(
  iso3c = codelist$iso3c,
  name  = ifelse(codelist$un.region.name == "Americas",
                 codelist$un.regionsub.name,
                 codelist$un.region.name)
)

countrycode(c("CAN", "BRA", "ARG"), "iso3c", "name", custom_dict=cd)
#> [1] "Northern America"                "Latin America and the Caribbean"
#> [3] "Latin America and the Caribbean"

It's not very complicated to add a new code, so that cost isn't tremendously high, but I do see some negative to a proliferation of entries in the documentation and codelist data.frame.

I'm curious what you and @cjyetman think about this.

cjyetman commented 3 years ago

Agreed... region names/specifications are even less standardized than country names/codes, which means there's even more minor variations of them, and that's a rabbit hole we've tried to avoid getting sucked into.

As an example, we eventually added Eurostat codes in #93, but I understood the argument for that was that even though the difference is minor (only Greece and UK with "custom" codes), datasets labeled with Eurostat codes are so common, that adding a new code was worth it.

I don't know/think that this region specification is equally common or widespread, but I could be wrong.

We've made an admirable effort to facilitate using custom code systems and variations with the custom_dict and custom_match arguments. @vincentarelbundock's example code above, while maybe not perfectly obvious and intuitive to a new user, I believe is easy, straight-forward, and a suitable solution for situations like this.

guyabel commented 3 years ago

I would prefer a new entry in the codelist to give a one line solution. I am unlikely to use a custom dictionary over reading in the Locations.xlsx and doing a dplyr::left_join(), as the latter is easier to explain in the labs for my class (we have already covered these functions).

I'm fairly confident that the data from the UN population division has a large user base, and the distinct between North America and the rest of the Americas is important for many socio-demographic analyses. No clue on how many of the user base work with R though or the cost of adding an entry to codelist.

vincentarelbundock commented 3 years ago

Alright. Since this can be handled with one-liners (either via custom_dict or left_join), I think I'll close this issue for now. Might revisit later, but given the easy solution, I won't treat this as high priority.

I know this might be disappointing, @guyabel , but I really appreciate you raising the issue anyway. Feedback is always useful, even if we don't always say "yes".

PS: Honestly, I kind of regret our earlier Eurostat precedent because can of worms, etc.