vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
346 stars 84 forks source link

Saint Martin (French) is (hard to match | matched with Dutch) #279

Closed luispfonseca closed 3 years ago

luispfonseca commented 3 years ago

Thank you for your work for this package.

I found this issue. I don't have time to fix at the moment, but this is something I could look into at a later time if you tell me what needs to be fixed.

library(countrycode)

packageVersion("countrycode")
#> [1] '1.2.0'

countryname(c("Sint Maarten", "Saint-Martin", "Saint Martin", "Saint Martin FR"))
#> [1] "Sint Maarten" "Sint Maarten" NA             NA

countrycode(c("Sint Maarten", "Saint-Martin", "Saint Martin", "Saint Martin FR"),
            origin = "country.name.en", destination = "country.name.en")
#> Warning in countrycode(c("Sint Maarten", "Saint-Martin", "Saint Martin", : Some values were not matched unambiguously: Saint Martin, Saint-Martin
#> [1] "Sint Maarten"               NA                          
#> [3] NA                           "Saint Martin (French part)"

Created on 2021-06-22 by the reprex package (v2.0.0)

NilsEnevoldsen commented 3 years ago

For the countryname case:

For the countrycode case:

Of interest: https://github.com/vincentarelbundock/countrycode/blob/b3c550e6056eb469520844578603871c29142393/tests/testthat/test-regex-special.R#L75-L106

vincentarelbundock commented 3 years ago

Thanks @NilsEnevoldsen for the clarification. Super useful, as always!

luispfonseca commented 3 years ago

Thank you for such a quick response!

I think one part of the issue is that I may have misunderstood the countryname function. In any case, it does seem inconsistent to me that "Saint-Martin" would be matched with "Sint Maarten" but "Saint Martin" would be unmatched.

I am not well versed in both geography and the criteria of countrycode, so to ensure I do not misunderstand the issue, please allow me to ask:

It appears in any case that this is desired behavior, so I think the issue can be closed. For now, I am just asking so I have a better understanding of the package, as I use it regularly. Feel free to direct me to documentation I may have missed.

Thank you for your work in this great package.

vincentarelbundock commented 3 years ago

I'm not sure it's possible to draw clean lines between the multiple forms of sovereign and quasi-sovereign entities out there. There are dozens of ongoing territorial disputes, and the UN can't even seem to resolve them! Who are we to think we can adjudicate? At some point, we have to concede that any conversion scheme in countrycode will be highly imperfect and sometimes inconsistent. So while your first bullet is definitely a consideration, I don't think it is necessarily dispositive.

From my perspective (not speaking for others), the more important issue is your second bullet: ambiguity. If "Saint Martin" could refer to the whole island - including both Sint Maarten and Saint Martin (French Part) - but also refer to just the French part, then ambiguity arise. In those cases, I think it is "safer" for countrycode to return NA and to issue an explicit warning to that effect.

Users can easily use the nomatch or the custom_match arguments to fill-in the missing value. It's a small additional burden on the user, but at least we guard against a potential problem.

luispfonseca commented 3 years ago

Makes sense to me. The only thing then I'd think could still warrant a potential change is the inconsistency in output between "Saint-Martin" and "Saint Martin" for the countryname function.

library(countrycode)

packageVersion("countrycode")
#> [1] '1.2.0'

countryname(c("Sint Maarten", "Saint-Martin", "Saint Martin"))
#> [1] "Sint Maarten" "Sint Maarten" NA

Thank you for taking the time, once again. I'll close this.

vincentarelbundock commented 3 years ago

Good catch!

The issue with countryname is that we use a massive set of name variations to do automagic conversion. This usually gives good results, but can sometimes produce ambiguous ones, like in this case. It is not realistic to audit this massive set of variations manually, so countryname will always remain inherently more “dangerous” than countrycode. For instance:

library(countrycode)

countryname(c("Sint Maarten", "Saint-Martin", "Saint Martin"))
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Saint Martin
#> [1] "Sint Maarten" "Sint Maarten" NA

countrycode(c("Sint Maarten", "Saint-Martin", "Saint Martin"),
  origin = "country.name",
  destination = "country.name")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Saint-Martin, Saint Martin
#> [1] "Sint Maarten" NA             NA

I added a warning about this in the countryname documentation:

countryname