vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 83 forks source link

Examples for M49 needed, or support for regional codes? #272

Closed courtiol closed 2 years ago

courtiol commented 3 years ago

From a quick look at the documentation, countrycode seems to support UN M49 codes:

https://github.com/vincentarelbundock/countrycode/blob/a3732f916bdf7973eeeaf3cecec34618151eb302/R/codelist.R#L31

UN M49 codes are documented here or here

Example: Screen Shot 2021-04-05 at 08 55 34

Based on the UN description above, I would have expected the following to output "Northern Africa":

> countrycode::countrycode("015", origin = "un", destination = "un.regionsub.name")
[1] NA
Warning message:
In countrycode::countrycode("015", origin = "un", destination = "un.regionsub.name") :
  Some values were not matched unambiguously: 015

By investigating the data actually used in the package, I have now understood that the M49 code must be those of countries (without the leading 0 for some reason), not of regions themselves:

> countrycode::countrycode("12", origin = "un", destination = "un.regionsub.name")
[1] "Northern Africa"

So but perhaps a small indication in the manual or an example could help. I could not find any example in the documentation dealing with the case which I found quite counter-intuitive.

In retrospect the package and the function are called countrycode which should be clear enough, but I guess that my issue is related to #265.

Many thanks for this very useful package.

vincentarelbundock commented 3 years ago

Thanks for the report @courtiol . You diagnosed the problem correctly. countrycode's default conversion dictionary uses countries as its basic unit of conversion; it only includes codes that are associated with a given country.

That said, you can use arbitrary conversion in the custom_dict argument. In the README you'll see an example with US States, but the same mechanism would obviously work with region names as well. All you would need to do is supply a data.frame (maybe read from some CSV you found).

In the next release, there will also be a new countrycode_factory that makes it really easy for people to roll out their own functions: statecode, provincecode, regioncode or whatever.

I will try to include an example in the README eventually.

cjyetman commented 3 years ago

Since the data included in the package already has un.regionsub.code and un.regionsub.name, you could achieve this like...

library(countrycode)

custom_dict <- 
  unique(countrycode::codelist[, c("un.regionsub.code", "un.regionsub.name")])

countrycode(as.numeric("015"), origin = "un.regionsub.code", 
            destination = "un.regionsub.name", custom_dict = custom_dict)
#> [1] "Northern Africa"

You have to create a custom dictionary because countrycode() internally does not consider un.regionsub.code a valid origin code. I also coerced the input value to numeric so that it's in line with what the data has for un.regionsub.code.

cjyetman commented 3 years ago

@vincentarelbundock should we consider auto-coercing mismatched sourcever and origin types?

vincentarelbundock commented 3 years ago

Yeah, that sounds useful. I don't have a general mechanism in mind (yet), but maybe we could leverage (and perhaps extend) the new dictionary attribute support.

vincentarelbundock commented 2 years ago

@courtiol thanks again for raising this issue. If you want to suggest a specific wording to include in the README or documentation, I will happily review and merge a pull request.

@cjyetman, I have included a sanity check with an informative error when the origin code is numeric but the user tries to convert a character vector (00b8c29924e2fbfbab95e363e2a3ae13d09774b). At the moment I feel that this is a better, more explicitly solution than automagically converting inputs.

library(countrycode)
countrycode(2, "cown", "country.name")
#> [1] "United States"
countrycode("2", "cown", "country.name")
#> Error: To convert a `cown` code, `sourcevar` must be numeric.
cjyetman commented 2 years ago

@cjyetman, I have included a sanity check with an informative error when the origin code is numeric but the user tries to convert a character vector (00b8c29). At the moment I feel that this is a better, more explicitly solution than automagically converting inputs.

👍🏻 I think that's a reasonable solution.

side note: I currently strongly believe that a code that is numeric but has no numeric meaning, e.g. one cannot expect to perform any meaningful mathematical operations on them, should be stored and used as a string, but as un.regionsub.code has been historically stored as a numeric and there's no strong argument for changing it now, I think what you have done is the best approach.

vincentarelbundock commented 2 years ago

side note: I currently strongly believe that a code that is numeric but has no numeric meaning, e.g. one cannot expect to perform any meaningful mathematical operations on them, should be stored and used as a string, but as un.regionsub.code has been historically stored as a numeric and there's no strong argument for changing it now, I think what you have done is the best approach.

I had never thought about it this way, but it makes sense to me.

Of course, read.csv() and pretty much all the other file readers will always read those as numerics, so we must live in 2nd best world.

NilsEnevoldsen commented 2 years ago

side note: I currently strongly believe that a code that is numeric but has no numeric meaning, e.g. one cannot expect to perform any meaningful mathematical operations on them, should be stored and used as a string

FWIW, I support this.

vincentarelbundock commented 2 years ago

If we feel super bold, we could eventually issue a warning to recommend this. But with backward compatibility and common use-cases in mind, I don't think we could enforce this. (And maybe a warning would just be annoying.)