vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 84 forks source link

Virgin islands (USA) country code missing (VIR) #333

Open nellybiondi opened 1 year ago

cjyetman commented 1 year ago

Which version are you using?

library(countrycode)
packageVersion("countrycode")
#> [1] '1.5.0'
countrycode::countrycode("VIR", "iso3c", "country.name")
#> [1] "U.S. Virgin Islands"
nellybiondi commented 1 year ago

Same version 1.5.0 summarise() has grouped output by 'Country', 'Code', 'Year', 'Sex'. You can override using the .groups argument. Warning message: There were 2 warnings in mutate(). The first warning was: ℹ In argument: Code = countrycode(Country, origin = "country.name", destination = "iso3c"). Caused by warning: ! Some values were not matched unambiguously: Rodrigues, Virgin Islands (USA) ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.

cjyetman commented 1 year ago

"Virgin Islands (USA)" is a particularly hard string to match without causing a problem with matching "USA" to "USA".

These are the variations of names for Virgin Islands that will work and are tested: https://github.com/vincentarelbundock/countrycode/blob/7164698aca00c1192ac7cb9bb14f8098435fe023/tests/testthat/data-known-name-variations.R#L167-L179

for example...

library(countrycode)
packageVersion("countrycode")
#> [1] '1.5.0'

country_names <- 
  c(
      "U.S. Virgin Islands",
      "United States Virgin Islands",
      "US Virgin Islands",
      "U.S. Virgin Islands",
      "Virgin Islands, US",
      "Virgin Islands, U.S.",
      "Virgin Islands, (U.S.)",
      "Virgin Islands, (US)",
      "Virgin Islands US",
      "Virgin Islands U.S.",
      "Virgin Islands (U.S.)",
      "Virgin Islands (US)"
    )

countrycode(country_names, origin = "country.name", destination = "iso3c")
#>  [1] "VIR" "VIR" "VIR" "VIR" "VIR" "VIR" "VIR" "VIR" "VIR" "VIR" "VIR" "VIR"

One way around this if you have "Virgin Islands (USA)" in your source data is to use the custom_match argument, like so (there is an erroneous warning because "Virgin Islands (USA)" matches two different countries)...

library(countrycode)
countrycode(
  sourcevar = "Virgin Islands (USA)", 
  origin = "country.name", 
  destination = "iso3c",
  custom_match = c("Virgin Islands (USA)" = "VIR")
)
#> Warning: Some strings were matched more than once, and therefore set to <NA> in the result: Virgin Islands (USA),VIR,USA
#> [1] "VIR"

Otherwise, you could modify your source data first, with something like...

library(dplyr)
library(countrycode)

source_data <- data.frame(country = c("Virgin Islands (USA)", "Canada", "United States"))

source_data %>% 
  mutate(country = case_when(
    country == "Virgin Islands (USA)" ~ "U.S. Virgin Islands",
    .default = country
  )) %>% 
  mutate(iso3c = countrycode(country, "country.name", "iso3c"))
#>               country iso3c
#> 1 U.S. Virgin Islands   VIR
#> 2              Canada   CAN
#> 3       United States   USA