Open turbanisch opened 1 year ago
If you can come up with regexes, that would lovely!
But have you tried the countryname
function? If so, what problem did you run into?
Since {countrycode} does include CLDR names in Chinese, one could (rather easily) create a custom dictionary using the included codelist
to achieve exact matching, though it will not do fancy regex matching, e.g...
library(countrycode)
custom_dict <-
unique(countrycode::codelist[, c("cldr.short.zh", "country.name.en", "iso3c")])
countrycode(
"阿拉伯联合酋长国",
origin = "cldr.short.zh",
destination = "country.name.en",
custom_dict = custom_dict
)
#> [1] "United Arab Emirates"
countrycode(
"阿拉伯联合酋长国",
origin = "cldr.short.zh",
destination = "iso3c",
custom_dict = custom_dict
)
#> [1] "ARE"
But yes, fancy Chinese regexes would be very interesting!
Ah, good idea, @cjyetman ! I think eventually it would be nice to have a more lenient regex matching because even data from Chinese authorities is messy, for some countries it may list the full country name, for others just an abbreviated one.
I had quickly discarded countryname
but apparently I just picked the wrong examples: neither Germany nor France have Chinese versions and the one for China matches only the full country name (aka "People's Republic of China"):
library(tidyverse)
library(countrycode)
# no matches
countryname("德国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 德国
#> [1] NA
countryname("中国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 中国
#> [1] NA
# there are Chinese country names in the data but not for all countries
# some match the official country name and may be too strict
countrycode::countryname_dict %>%
as_tibble() %>%
filter(str_detect(country.name.alt, "\\p{script=Han}")) %>%
filter(country.name.en %in% c("Germany", "France", "China"))
#> # A tibble: 2 × 2
#> country.name.en country.name.alt
#> <chr> <chr>
#> 1 China 中华人民共和国
#> 2 China 中華人民共和國
Created on 2022-08-26 by the reprex package (v2.0.1)
I took a bit of a deep dive and came with regular expressions for Chinese by scraping Wikipedia. I outlined the issue (in short: Chinese has many variants, not just simplified vs. traditional scripts but also depending on regions) here and implemented a function as a proof of concept here.
Do let me know if you would like to incorporate the regular expressions into countrycode
and I will try to come up with a PR! I assume this is how I would have to prepare the codes? https://github.com/vincentarelbundock/countrycode#adding-a-new-code
There is one issue though that you might want to handle differently than I did: my function converts the input into simplified characters before matching via regular expressions. I discuss alternative implementations in the README mentioned above (https://github.com/turbanisch/chinese-countryname-regex). In short, for maintenance reasons I would suggest only adding the regular expresssions for simplified Chinese and perhaps add a note in the countrycode
documentation telling the user to apply a traditional-to-simplified conversion herself as needed.
@turbanisch this is really cool, thanks for all your effort so far
just a suggestion if/when you start building a PR... I would suggest following the model of adding known variants of each country to a data file for the tests, and then testing each one of them in a {testthat} test, as done here... https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/data-known-name-variations.R https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/test-known-name-variations.R
That all sounds great! Thanks for putting this together.
I wont be able to look at anything in the near term, but feel free to open a PR whenever you are ready, and I
ll try to review when possible.
Also agree with the know variations tests. Those would be super useful.
Thanks a lot to both of you! I will see how far I get and prepare a PR in the coming days.
So far, there seems to be no way of converting country names from Chinese, not even by left-joining any of the dataframes that come with countrycode. I assume the reason for this is the lack of a corresponding set of regexes? If there is any interest, I would like to offer to come up with one!