vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 84 forks source link

Micronesia regexes #354

Open mattkerlogue opened 6 months ago

mattkerlogue commented 6 months ago

Related to #289, I've recently been working with a table that has Micronesia (the country) listed solely as "Micronesia" not "Federated States of Micronesia" and thus countrycode returns an NA value.

I noticed in the discussion at #289 a reference to making a distinction between the subregion and the country, however on further inspecting the codelist dataset this seems to only be applied in the case of the English regex, while the French, German and Italian regexes only test for the name of subregion.

I've certainly seen datasets where the country is just referred to as Micronesia, but I've also seen it abbreviated as "FS Micronesia" or "F.S. Micronesia" which the current English regex would also miss. Moreover, country.name.de is simply a reference to the subregion "Mikronesien" rather than the full country name (e.g. "Mikronesien (Föderierten Staaten von)").

countrycode::codelist |>
  dplyr::filter(iso3c == "FSM") |>
  dplyr::select(
    country.name.en, country.name.fr, country.name.de, country.name.it,
    country.name.en.regex, country.name.fr.regex,
    country.name.de.regex, country.name.it.regex) |>
  dplyr::glimpse()

#>  Rows: 1
#>  Columns: 8
#>  $ country.name.en       <chr> "Micronesia (Federated States of)"
#>  $ country.name.fr       <chr> "Micronésie (États fédérés de)"
#>  $ country.name.de       <chr> "Mikronesien"
#>  $ country.name.it       <chr> NA
#>  $ country.name.en.regex <chr> "fed.*micronesia|micronesia.*fed"
#>  $ country.name.fr.regex <chr> "micron(é|e)sie"
#>  $ country.name.de.regex <chr> "mikronesien"
#>  $ country.name.it.regex <chr> "micronesia"

In my personal experience it's rare that I've come across lists/situations which include continents/continental subregions alongside countries, and if they do I'd ordinarily remove those from a list before trying to use countrycode() on it. So it did surprise me that "Micronesia" didn't return a country code.

Given that "Micronesia" is the only geographic term that can so closely be attributed to either a country or region my expectation would be that it would return the country code rather than return an NA.

stefgehrig commented 4 months ago

This is a common issue for me as well, and I work around it by using a custom matching from "Micronesia" to "Micronesia (Federated States of)" in all my applications. If it doesn't create problems in other situations, the suggestion by @mattkerlogue would be am improvement for my own use of the package (and probably many others)

cjyetman commented 4 months ago

I would consider this a "bug" in the non-English regexes and try to fix that. I realize that solution would likely not be very satisfying to the OP, but at least the behavior would be consistent between languages.

I would also suggest using the custom_match arg to work around this.

NilsEnevoldsen commented 4 months ago

I don't have a strong opinion. Happy to defer to @cjyetman's opinion.

What similar situations do we have in English?

> countrycode::countrycode("Korea", "country.name", "iso3c")
[1] "KOR"
> countrycode::countrycode("Sudan", "country.name", "iso3c")
[1] "SDN"
> countrycode::countrycode("America", "country.name", "iso3c")
[1] NA
Warning message:
Some values were not matched unambiguously: America 
> countrycode::countrycode("Congo", "country.name", "iso3c")
[1] "COG"
> countrycode::countrycode("Macedonia", "country.name", "iso3c")
[1] "MKD"
> countrycode::countrycode("Cyprus", "country.name", "iso3c")
[1] "CYP"

None of these are exactly the same situation. Maybe "America" is a weakly similar example.

FWIW, the UNGEGN official short name is Federated States of Micronesia (the), same as the formal name.

NilsEnevoldsen commented 4 months ago

One alternate suggestion: we could put in a custom error messages for a couple of the uniquely troublesome cases. i.e. a conversion from Micronesia as a country.name to anything else would return NA but also a suggestion to use custom_match(). I know some people don't like wordy error messages, but I think they can improve accessibility.

vincentarelbundock commented 4 months ago

Sorry for the delayed response.

I don't have a super strong view, but I guess I'd lean toward being stricter.

I personally like wordy error messages, and would be very happy to include that in a future version.

For transparency though, I'm not sure I'll get to it myself soon. But I'd be happy to review and merge a Pull Request if someone wants to implement.