vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 83 forks source link

Some countries may be missing from the database #306

Closed stitam closed 2 years ago

stitam commented 2 years ago

Hi,

Many thanks for developing this super useful package. While working with it I noticed Kosovo may be missing?

countrycode::countrycode("kosovo", "country.name", "iso2c")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: kosovo
#> [1] NA

Created on 2022-06-28 by the reprex package (v2.0.1)

As far as I know the 2 character ISO code for Kosovo is XK.

vincentarelbundock commented 2 years ago

Thanks, but please see this issue raised by another user: https://github.com/vincentarelbundock/countrycode/issues/305

stitam commented 2 years ago

Thanks @vincentarelbundock for the quick reply. I see now this issue is regarded as a wontfix. However I am wondering if it would be possible to add these user-assigned instances optionally, e.g. use an argument which specifies whether non-official, user-assigned entries should be used as well? I understand why Kosovo is tricky but it would be great if the package could handle tricky situations without manual workarounds. It makes developing functions and packages based on countrycode more difficult.

vincentarelbundock commented 2 years ago

I’d be happy to consider a specific feature request (e.g., with an actual dictionary of corner cases), but here are a few of the obstacles I anticipate:

  1. Who is the “user” in your “user-assigned” instance? How do we know if they are reliable? Do we have to make judgment calls ourselves? There may be easy cases, but the status of some entities is more controversial (Taiwan?). The position of countrycode is that we try to follow the codes created by the organizations as strictly as possible to let them make judgment calls.

  2. Do we have to find and maintain ad hoc dictionaries for every one of our 60+ codes? 2 or 3 letters, numeric, etc.?

In my view, a better way to deal with this is to explicitly define a “fallback” code. For instance, we know that the the ECB officially designates Kosovo as “XK”. So we can ask countrycode to try “iso2c” first, and if that doesn’t work fill-in with “ecb”:

library(countrycode)

countrycode(
    "Kosovo",
    origin = "country.name",
    destination = c("iso2c", "ecb"),
    warn = FALSE)
#> [1] "XK"
cjyetman commented 2 years ago

As suggested in #305, you can also easily use the custom_match argument...

library(countrycode)
country_names <- c('Greece', 'United Kingdom', 'Kosovo', 'France')
countrycode(country_names, 
            origin = 'country.name', 
            destination = 'iso2c', 
            custom_match = c(`Kosovo` = 'XK'))
#> [1] "GR" "GB" "XK" "FR"

Or use a custom dictionary that includes your desired alternate/additional codes...

library(countrycode)
country_names <- c('Greece', 'United Kingdom', 'Kosovo', 'France')

iso2c_plus <- countrycode::codelist
iso2c_plus$iso2c[iso2c_plus$country.name.en == "Kosovo"] <- "XK"

countrycode(country_names, 
            origin = 'country.name.en.regex', 
            destination = 'iso2c', 
            custom_dict = iso2c_plus,
            origin_regex = TRUE)
#> [1] "GR" "GB" "XK" "FR"
stitam commented 2 years ago

Thank you for these suggestions. I think destination = c("iso2c", "ecb") is a perfect workaround since it does not require the definition of a custom database, yet it provides a clear and non-ambiguous rule which resolves Kosovo. For me it's not a matter of judgement call but practicality: if Kosovo is recognised in my input data I have to treat it as such in my analysis and I want to do it as reproducibly as possible. Problem solved, thank you!