Closed guyabel closed 3 years ago
What is the relationship to un.regionsub.name
and un.regionintermediate.name
?
table(countrycode::codelist$un.regionsub.name)
#>
#> Australia and New Zealand Central Asia
#> 6 5
#> Eastern Asia Eastern Europe
#> 7 10
#> Latin America and the Caribbean Melanesia
#> 52 5
#> Micronesia Northern Africa
#> 8 7
#> Northern America Northern Europe
#> 5 16
#> Polynesia South-eastern Asia
#> 10 11
#> Southern Asia Southern Europe
#> 9 16
#> Sub-Saharan Africa Western Asia
#> 53 18
#> Western Europe
#> 9
table(countrycode::codelist$un.regionintermediate.name)
#>
#> Caribbean Central America Channel Islands Eastern Africa Middle Africa
#> 28 8 2 22 9
#> South America Southern Africa Western Africa
#> 16 5 17
table(countrycode::codelist$un.region.name)
#>
#> Africa Americas Asia Europe Oceania
#> 60 57 50 51 29
Created on 2020-11-13 by the reprex package (v0.3.0.9001)
Looks like the Population Division are taking the un.regionsub.name
in place of un.region.name
for countries in the Americas (un.region.name == "Americas"
)
Thanks for looking into this!
Since it looks like a simple variation on existing codes, I'm not 100% convinced we should include yet another column in the dictionary. For instance, one could easily create a custom dictionary and do this:
library(countrycode)
cd <- data.frame(
iso3c = codelist$iso3c,
name = ifelse(codelist$un.region.name == "Americas",
codelist$un.regionsub.name,
codelist$un.region.name)
)
countrycode(c("CAN", "BRA", "ARG"), "iso3c", "name", custom_dict=cd)
#> [1] "Northern America" "Latin America and the Caribbean"
#> [3] "Latin America and the Caribbean"
It's not very complicated to add a new code, so that cost isn't tremendously high, but I do see some negative to a proliferation of entries in the documentation and codelist
data.frame.
I'm curious what you and @cjyetman think about this.
Agreed... region names/specifications are even less standardized than country names/codes, which means there's even more minor variations of them, and that's a rabbit hole we've tried to avoid getting sucked into.
As an example, we eventually added Eurostat codes in #93, but I understood the argument for that was that even though the difference is minor (only Greece and UK with "custom" codes), datasets labeled with Eurostat codes are so common, that adding a new code was worth it.
I don't know/think that this region specification is equally common or widespread, but I could be wrong.
We've made an admirable effort to facilitate using custom code systems and variations with the custom_dict
and custom_match
arguments. @vincentarelbundock's example code above, while maybe not perfectly obvious and intuitive to a new user, I believe is easy, straight-forward, and a suitable solution for situations like this.
I would prefer a new entry in the codelist
to give a one line solution. I am unlikely to use a custom dictionary over reading in the Locations.xlsx and doing a dplyr::left_join()
, as the latter is easier to explain in the labs for my class (we have already covered these functions).
I'm fairly confident that the data from the UN population division has a large user base, and the distinct between North America and the rest of the Americas is important for many socio-demographic analyses. No clue on how many of the user base work with R though or the cost of adding an entry to codelist
.
Alright. Since this can be handled with one-liners (either via custom_dict
or left_join
), I think I'll close this issue for now. Might revisit later, but given the easy solution, I won't treat this as high priority.
I know this might be disappointing, @guyabel , but I really appreciate you raising the issue anyway. Feedback is always useful, even if we don't always say "yes".
PS: Honestly, I kind of regret our earlier Eurostat precedent because can of worms, etc.
Would it be possible to add a dictionary for the UN population division region names and codes? These differ fro the UN regions already in
codelist$un.region.name
, most notably they separate the Americas into Northern America and Latin America and the Caribbean. They have been doing this for at least the last 10 years in both their World Population Prospects (WPP) and International Migrant Stock data sets. Their codes are usually given in an excel spreadsheet on each release of the WPP, see for example Locations.xlsx here: https://population.un.org/wpp/Download/Metadata/Documentation/ - column Q for the region names.I couldn't quite figure out from your README how to do this for regions and when the data online is in a excel spreadsheet.
Perhaps some of the other categories in that spreadsheet might also be of use for the countrycode package, such as the SDG regions and the UN and World Bank developmental groups?