Closed davidsjoberg closed 4 years ago
Very nice. Here's one way this could potentially be integrated into the countrycode
.
countrycode
.countryname
with default arguments origin='country.name.en.regex'
and destination='cldr.short.en'
.countryname
first tries to convert using the regexes, and then fills-in missing values using the list of known alternatives. This could be a two-step process, or even a single call to countrycode
using the custom_match
argument with our new named vector of variants.destination
is not cldr.short.en
, then we first convert to "clean" country names that we know countrycode
can handle (again using the list of known alternatives), and then we call countrycode
a second time to convert to iso3c or whatever.This could be quite nice, because it would reduce the need for regexes to catch every single weird known variants. It would also require almost no new coding effort, since all the building blocks are here already.
Not sure if @davidsjoberg gets notifications unless I ping them.
OK, so here's a proof of concept. I've considerably expanded the idea by including all the CLDR country names, which means that we can now translate country names from any language (as long as they are "official" well-formed names in the respective languages). I stored the dictionary build code and the function in a Gist, which I load using source
.
https://gist.github.com/vincentarelbundock/2e00c19c1972e73d708c7cb496bd00d2
url <- 'https://gist.githubusercontent.com/vincentarelbundock/2e00c19c1972e73d708c7cb496bd00d2/raw/15b9cabe96511521a1b40bb9b6d6a0c97662eebe/countryname.R'
source(url)
weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK', 'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar')
countryname(weird)
#> [1] "Zimbabwe"
#> [2] "Afghanistan"
#> [3] "Barbados"
#> [4] "Sweden"
#> [5] "UK"
#> [6] "South Georgia & South Sandwich Islands"
This is rough, but what do you think? (also curious about @cjyetman and @NilsEnevoldsen opinions on this.)
Mixed feelings. IMHO these alternative names fall into two categories: English names and Non-English names.
English names should just be added to country.name.en.regex
. (Indeed many already have been.)
Non-English names I'm unsure about.
Quick thoughts before I fall asleep...
This List of alternative country names is a cool resource. We should use this to build out our known name variations in our test suite
Personally, I think it would be super cool if our regexes matched country names in any/all languages, though I'm uncertain how much of a performance concern that would be. I can't imagine a scenario where a user would be upset that countrycode
gave a valid result for a country name that was non-English even though they didn't explicitly ask to match non-English variations.
Ideally, I would think this should work without any multi-stage process. It should just work, as countrycode
always has.
To integrate these variations (language or otherwise) into our existing regexes, I see two paths. 1) convert them all into regexes as we've done in the past, but would probably be a big undertaking, 2) add them as alterations as I've suggested before here #197, which would likely be a lot easier, but ultimately somewhat less flexible in terms of catching unknown variations of these variations.
and one quick caution... that Wikipedia page include a lot of things, some labeled as "ambiguous", some as "former" names, some as "initialism", etc... I'm not sure how to easily assess all of them and whether or not they should or should not be included as valid variations, especially the ones in non-Latin scripts, which I can't possibly judge their validity.
All good points.
I agree that to the extent possible, known English variations should be caught by our regexes. We should also exploit that cool wikipedia list to build up our test suite. Those two things should be priority number 1.
That said, the possibility of converting names in many languages seems super cool to me. I was playing with some Eurobarometer data where countries are encoded in local languages, and the crappy function above did a nice job of converting most of them. That's quite nice to have out of the box.
I do agree with the caution raised above w.r.t. the wikipedia list. For instance, it's probably not desirable to automatically convert "Rhodesia" to "Zimbabwe" without any warning. I also empathize with the concern about our inability to check other languages ourselves. In that respect, the CLDR names seem less problematic, since presumably there is some quality assurance there.
With respect to proposal https://github.com/vincentarelbundock/countrycode/issues/197, I think we should probably go ahead for the most difficult cases. We know the countries which have been pain points since the start. It's probably better to be a bit conservative and more explicit with those.
FWIW I did a quick benchmark of the propose "US Virgin Islands" regex against the one we currently use. The longer more explicit version was about 5 times faster on a large dataset.
But maybe we don't have to go overboard and banish dots entirely, so that:
"United States Virgin Islands|US Virgin Islands|U.S. Virgin Islands|Virgin Islands US|Virgin Islands U.S.|Virgin Islands, US|Virgin Islands, U.S."
simplifies to:
"United States Virgin Islands|U.S. Virgin Islands|Virgin Islands. U.S."
Nice discussion!
I actually included all non-english names but didn't include them in my example since reprex
can't handle weird characters. I think non-english names is a nice feature, but should probably but dominated by english variations, i.e. if ambigious then the english version should be used. If it is unique it doesn't hurt to add that feature.
I also think the ambigious alternative names should be exluded. Maybe even exclude all variations that you do not understand like @cjyetman raised concern of.
But I want to raise a few points. First, countrycode
actually was more powerful than I knew. For example:
> countrycode::countrycode("gret britain", 'country.name', 'iso3c')
[1] "GBR"
> countrycode::countrycode("Uk.", 'country.name', 'iso3c')
[1] "GBR"
>
It correctly catched two weirdish country names. And actually solved my original problem. However, I think it was hard to understand this behaviour from the documentation. For example, country.name
is not a column in countrycode::codelist
. A suggestion would be to have country.name
as default in the countrycode
function. That would help the user (like me) who just tries to join data frames with country names and doesn't know anything about iso standards etc. Seems like a too steep a learning curve for the issue. The wise who know what they have and what they want should be able to specify it.
Another issue is countrycode speed. It's probably the regex which is very comutationally heavy. For example. I made a micro benchmark with countrycode::countrycode
and simplecountries::simple_country_name
:
x <- countrycode::codelist$country.name.en
microbenchmark(
countrycode::countrycode(x, 'country.name', 'country.name'),
countrycode::countrycode(x, 'country.name.en', 'country.name.en'),
simple_country_name(x),
times = 10
)
My function is about 2000 (!) times faster. Even when the input, origin and desination is the same. I think a simple lookup table should be used and regex should only be used when absolutely necessary. I could be wrong here since I haven't read your source code.
To conclude, I think it would be awesome if countrycode had a "guess" as defualt much like country.name
in the example above. Also, I think you should use a banal lookuptable in the backend and only turn to regex for non-matches to greatly improve efficiency. Also, add som of the alternative country names from Wikipedia that are used but might not be in any formal country names definitions.
Generally, I like the idea of having origin = "country.name"
as a default, but I think we have had that discussion before and decided against it. One unfortunate result of doing that would be that users would eventually want/expect to be able to do something like countrycode(df$country, "iso3c")
, but this would not work unless they explicitly named the destination parameter (i.e. countrycode(df$country, destination = "iso3c")
) because of the existing order of the parameters. The only way around that would be to reorder the parameters, which would easily break lots of old code, so it's a non-starter.
Yes. I see your point. Probably is better to have a new function then that removes the origin paramter so that you can do somthe like country_name(df$country, "iso3c")
like you said.
and the background for why country.name
is not a column in countrycode::codelist
...
There are currently two sets of regexes, one for English (country.name.en.regex
) and one for German (country.name.de.regex
). country.name
is sort of an "alias" for country.name.en.regex
to maintain backward compatibility. This infrastructure allows us to add regex sets for other languages etc. In fact, the use case being discussed here could become something like a column added to countrycode::codelist
named country.name.alllanguages.regex
which would contain regexes that match country names in as many languages as possible, and then the user would still have the normal behavior, but they could use something like this to match country names in any language: countrycode(df$country, origin = "country.name.alllanguages.regex", destination = "iso3c")
Cool.
I'm not very enthusiastic about the prospect of changing default values of arguments or their order. But I think I see a consensus emerging:
countryname
whose goal is to magically convert country names from any language to any code.Does that work?
Yes. That sounds very nice! Maybe obvious, but countryname
has a default destination to country.name
?
Why limit to CLDR? You know more about the different origins/destination but the more the merrier, given it unambigious?
Yes, "country.name" would be the default destination.
Oh sorry, all the other names we have in codelist
would make it: ISO, CoW, UN, etc. I thought we had come to agreement that wikipedia should be excluded because it includes many problematic cases, and because we don't have the language skills to distinguish.
Also, even if "Republic of Albania" (in Wiki but not codelist) is missing from the lookup table, our regexes will catch it.
Ok. sounds good. But the more you can add to the lookup table the faster the function will be.
But it makes sense not to include all Wikipedia alternatives since many of them are weird. I haden't heard all of the alternative names of Sweden...
I made an alternate version of countryname
that does exact matching first using custom_dict
and then runs a second pass through the default regex on any unmatched values. @davidsjoberg is correct that doing it that way has a significant speed advantage (not that I doubted that). Might have to do a third pass though, unfortunately, to enable a selectable destination code since the custom dictionary only has one destination code.
library(countrycode)
library(microbenchmark)
source('https://gist.githubusercontent.com/vincentarelbundock/2e00c19c1972e73d708c7cb496bd00d2/raw/15b9cabe96511521a1b40bb9b6d6a0c97662eebe/countryname.R')
countryname_cust_match <-
function(sourcevar,
origin = 'country.name.en',
destination = 'cldr.short.en') {
out <-
countrycode(sourcevar = sourcevar,
origin = origin,
destination = destination,
custom_match = alternatives,
warn = FALSE)
return(out)
}
alternatives_cust_dict <-
tibble(country.name = names(alternatives),
cldr.short.en = alternatives) %>%
filter(!duplicated(country.name)) # "لتونی"matches both Latvia and Lithuania!?!
countryname_cust_dict <-
function(sourcevar) {
out <-
countrycode(sourcevar = sourcevar,
origin = 'country.name',
destination = 'cldr.short.en',
custom_dict = alternatives_cust_dict,
warn = FALSE)
out[is.na(out)] <-
countrycode(sourcevar = sourcevar[is.na(out)],
origin = 'country.name.en',
destination = 'cldr.short.en',
warn = FALSE)
return(out)
}
weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK', 'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar')
countryname_cust_match(weird)
#> [1] "Zimbabwe"
#> [2] "Afghanistan"
#> [3] "Barbados"
#> [4] "Sweden"
#> [5] "UK"
#> [6] "South Georgia & South Sandwich Islands"
countryname_cust_dict(weird)
#> [1] "Zimbabwe"
#> [2] "Afghanistan"
#> [3] "Barbados"
#> [4] "Sweden"
#> [5] "UK"
#> [6] "South Georgia & South Sandwich Islands"
x <- countrycode::codelist$country.name.en
microbenchmark(
countryname_cust_match(x),
countryname_cust_dict(x),
times = 20
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> countryname_cust_match(x) 888.86483 902.83125 920.65200 908.82574 923.27087
#> countryname_cust_dict(x) 70.64385 73.73317 76.73232 77.20547 79.36352
#> max neval
#> 1040.08927 20
#> 85.15852 20
identical(countryname_cust_match(x), countryname_cust_dict(x))
#> [1] TRUE
Actually, there's a "bug" in @vincentarelbundock's version of countryname
... because the custom_match
overrides every valid regex match with a value that doesn't fit the non-default destination code...
source('https://gist.githubusercontent.com/vincentarelbundock/2e00c19c1972e73d708c7cb496bd00d2/raw/15b9cabe96511521a1b40bb9b6d6a0c97662eebe/countryname.R')
weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK', 'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar')
countryname(weird)
#> [1] "Zimbabwe"
#> [2] "Afghanistan"
#> [3] "Barbados"
#> [4] "Sweden"
#> [5] "UK"
#> [6] "South Georgia & South Sandwich Islands"
countryname(weird, destination = 'iso3c')
#> [1] "Zimbabwe"
#> [2] "Afghanistan"
#> [3] "Barbados"
#> [4] "Sweden"
#> [5] "GBR"
#> [6] "South Georgia & South Sandwich Islands"
I would suggest something like this... the third pass through countrycode
only occurs if a non-default destination code is chosen, and it doesn't have much impact on the speed because it also forces an exact matching...
source('https://gist.githubusercontent.com/vincentarelbundock/2e00c19c1972e73d708c7cb496bd00d2/raw/15b9cabe96511521a1b40bb9b6d6a0c97662eebe/countryname.R')
alternatives_cust_dict <-
tibble(country.name = names(alternatives),
cldr.short.en = alternatives) %>%
filter(!duplicated(country.name)) # "لتونی"matches both Latvia and Lithuania!?!
countryname <-
function(sourcevar, destination = 'cldr.short.en', warn = FALSE) {
out <-
countrycode(sourcevar = sourcevar,
origin = 'country.name',
destination = 'cldr.short.en',
custom_dict = alternatives_cust_dict,
warn = FALSE)
out[is.na(out)] <-
countrycode(sourcevar = sourcevar[is.na(out)],
origin = 'country.name.en',
destination = 'cldr.short.en',
warn = warn)
if (destination != 'cldr.short.en') {
out <-
countrycode(sourcevar = out,
origin = 'cldr.short.en',
destination = destination,
custom_dict = countrycode::codelist,
warn = warn)
}
return(out)
}
weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK', 'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar')
countryname(weird)
#> [1] "Zimbabwe"
#> [2] "Afghanistan"
#> [3] "Barbados"
#> [4] "Sweden"
#> [5] "UK"
#> [6] "South Georgia & South Sandwich Islands"
countryname(weird, destination = 'iso3c')
#> [1] "ZWE" "AFG" "BRB" "SWE" "GBR" "SGS"
library(microbenchmark)
x <- countrycode::codelist$country.name.en
microbenchmark(
countryname(x),
countryname(x, destination = 'iso3c'),
times = 20
)
#> Unit: milliseconds
#> expr min lq mean median
#> countryname(x) 70.35332 71.76669 75.99029 75.63757
#> countryname(x, destination = "iso3c") 70.05191 74.09016 77.45305 77.18706
#> uq max neval
#> 77.95850 90.27649 20
#> 80.20842 86.50942 20
head(countryname(x))
#> [1] "Afghanistan" "Åland Islands" "Albania" "Algeria"
#> [5] "American Samoa" "Andorra"
head(countryname(x, destination = 'iso3c'))
#> [1] "AFG" "ALA" "ALB" "DZA" "ASM" "AND"
Does that also work for factor or tibble sourcevar?
On Thu, May 21, 2020, at 05:48, CJ Yetman wrote:
Actually, there's a "bug" in @vincentarelbundock https://github.com/vincentarelbundock's version of
countryname
... because thecustom_match
overrides every valid regex match with a value that doesn't fit the non-default destination code...source('https://gist.githubusercontent.com/vincentarelbundock/2e00c19c1972e73d708c7cb496bd00d2/raw/15b9cabe96511521a1b40bb9b6d6a0c97662eebe/countryname.R') weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK',
'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar') countryname(weird)> [1] "Zimbabwe"
> [2] "Afghanistan"
> [3] "Barbados"
> [4] "Sweden"
> [5] "UK"
> [6] "South Georgia & South Sandwich Islands"
countryname(weird, destination = 'iso3c')
> [1] "Zimbabwe"
> [2] "Afghanistan"
> [3] "Barbados"
> [4] "Sweden"
> [5] "GBR"
> [6] "South Georgia & South Sandwich Islands"
I would suggest something like this... the third pass through
countrycode
only occurs if a non-default destination code is chosen, and it doesn't have much impact on the speed because it also forces an exact matching...alternatives_cust_dict <- tibble(country.name = names(alternatives), cldr.short.en = alternatives) %>% filter(!duplicated(country.name)) # "لتونی"matches both Latvia and Lithuania!?!
countryname <- function(sourcevar, destination = 'cldr.short.en', warn = FALSE) {
out <- countrycode(sourcevar = sourcevar, origin = 'country.name', destination = 'cldr.short.en', custom_dict = alternatives_cust_dict, warn = FALSE) out[is.na(out)] <- countrycode(sourcevar = sourcevar[is.na(out)], origin = 'country.name.en', destination = 'cldr.short.en', warn = warn) if (destination != 'cldr.short.en') { out <- countrycode(sourcevar = out, origin = 'cldr.short.en', destination = destination, custom_dict = countrycode::codelist, warn = warn) } return(out)
}
weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK',
'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar') countryname(weird)> [1] "Zimbabwe"
> [2] "Afghanistan"
> [3] "Barbados"
> [4] "Sweden"
> [5] "UK"
> [6] "South Georgia & South Sandwich Islands"
countryname(weird, destination = 'iso3c')
> [1] "ZWE" "AFG" "BRB" "SWE" "GBR" "SGS"
library(microbenchmark)
x <- countrycode::codelist$country.name.en
microbenchmark( countryname(x), countryname(x, destination = 'iso3c'), times = 20 )
> Unit: milliseconds
> expr min lq mean median
> countryname(x) 70.35332 71.76669 75.99029 75.63757
> countryname(x, destination = "iso3c") 70.05191 74.09016 77.45305 77.18706
> uq max neval
> 77.95850 90.27649 20
> 80.20842 86.50942 20
head(countryname(x))
> [1] "Afghanistan" "Åland Islands" "Albania" "Algeria"
> [5] "American Samoa" "Andorra"
head(countryname(x, destination = 'iso3c'))
> [1] "AFG" "ALA" "ALB" "DZA" "ASM" "AND"
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/countrycode/issues/239#issuecomment-631993605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQ7MONGUQ5ONMA2DNISDDRST2HJANCNFSM4NEGPEXA.
-- Vincent Arel-Bundock
Professeur agrégé / Associate professor http://arelbundock.com Université de Montréal, Science politique 3150 rue Jean-Brillant, Pav. Lionel-Groulx, C-4020 Montréal, Québec, Canada, H3T 1N8
It should work exactly how countrycode
works... factors as sourcevar work fine, if a tibble is subset it will fail with error message, if a tibble column is properly extracted it will work fine.
source('https://gist.githubusercontent.com/vincentarelbundock/2e00c19c1972e73d708c7cb496bd00d2/raw/15b9cabe96511521a1b40bb9b6d6a0c97662eebe/countryname.R')
alternatives_cust_dict <-
tibble(country.name = names(alternatives),
cldr.short.en = alternatives) %>%
filter(!duplicated(country.name)) # "لتونی"matches both Latvia and Lithuania!?!
countryname <-
function(sourcevar, destination = 'cldr.short.en', warn = FALSE) {
out <-
countrycode(sourcevar = sourcevar,
origin = 'country.name',
destination = 'cldr.short.en',
custom_dict = alternatives_cust_dict,
warn = FALSE)
out[is.na(out)] <-
countrycode(sourcevar = sourcevar[is.na(out)],
origin = 'country.name.en',
destination = 'cldr.short.en',
warn = warn)
if (destination != 'cldr.short.en') {
out <-
countrycode(sourcevar = out,
origin = 'cldr.short.en',
destination = destination,
custom_dict = countrycode::codelist,
warn = warn)
}
return(out)
}
weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK', 'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar')
weird_fctr <- as.factor(weird)
countryname(weird_fctr)
# works
countryname(weird_fctr, destination = 'iso3c')
# works
weird_tbl <- tibble(weird)
countryname(weird_tbl)
#fails
countryname(weird_tbl[1])
#fails
countryname(weird_tbl[, 1])
#fails
countryname(weird_tbl[[1]])
# works
countryname(weird_tbl$weird)
# works
If you wanted it to work with subset tibbles that have only 1 column, you could add this to the top of the function (but I thought you decided against doing any manipulation like this)
if (inherits(sourcevar, "tbl_df") & length(sourcevar) == 1) {
message("You passed a tibble with 1 column. You probably meant to *extract*
the column rather than subset the tibble. Since it's pretty easy to
guess what you want since there's only one column, we're gonna do
that for you automatically. If you don't want to see this message anymore
*extract* the column you want to use as a sourcevar before passing it to
countrycode (i.e. it must be a vector, not a one column tibble)")
sourcevar <- sourcevar[[1]]
}
The countryname
function is now implemented in 749843e
FYI, @cjyetman I had to use country.name.en
internally, because cldr.short.en
does not cover all names (e.g., cow.name
). There might be a performance hit that we want to investigate.
Thanks @davidsjoberg for the great suggestion, and thanks all for your insights!
Please let me know if it works on your end so I can close the issue.
This should work:
remotes::install_github('vincentarelbundock/countrycode')
library(countrycode)
weird <- c('ジンバブエ', 'Afeganistãu', 'Barbadas', 'Sverige', 'UK', 'il-Georgia tan-Nofsinhar u l-Gżejjer Sandwich tan-Nofsinhar')
countryname(weird)
countryname(weird, 'iso3c')
Works great! Thank you for fast development and responsiveness!
Well done! I'll update my repo readme of simplecountry
to use countrycode::countryname
instead :) You may close the issue. When do you think it will be on CRAN?
Cool. Not 100% sure. Hopefully next week.
In my immature package
simplecountries
I wanted to build a magic function that takes any character vector of country names and translates it to common english.Reasons for a new package
countrycode
many times before but thought that I needed to know which definition of the country variation that the sources was using which was not possible or time consuming at times. But sinceTo do this I scraped the wikipedia page of alternative country names. Translated the Wikipedia common names to
countrycode
'scountry.name.en
column (maybe there is a better one?). Then I made a huge lookuptable and deleted non-unique combinations.This created a magic (wildcard) function that accepts any variation of country names and translates it to a simple country name which can be used on two datasets to be able to make a join or to simplify country names in a plot.
The code for the lookup table that is used in
simple_country_name
can be found here.If you want to include, or already have that I missed, a magic function it would be great. It is much better suited in the already well written and mature
countrycode
.My suggestion for API would be a variation of
countrycode::countrycode
that does NOT requireorigin
nordestination
. But destination could absolutely be an argument for those who want it. But there should be a decent default, likecountry.name.en
. Also, it would be nice to include Wikipedia's alternative country names.That's all. Thanks for a great package!