vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
336 stars 84 forks source link

`TÜRKİYE` a valid spelling? #347

Open cjyetman opened 10 months ago

cjyetman commented 10 months ago

The list-one.xls that we use from https://www.six-group.com/en/products-services/financial-information/data-standards.html to get the ISO4217 codes in get_iso4217.R (or better, will use once the URL is updated in an PR #348) uses TÜRKİYE (note the İ). Is this a valid spelling we should support?

vincentarelbundock commented 10 months ago

Probably relevant: https://en.wikipedia.org/wiki/%C4%B0#:~:text=%C4%B0%2C%20or%20i%2C%20called%20dotted,Kazakh%2C%20Tatar%2C%20and%20Turkish.

NilsEnevoldsen commented 10 months ago

Do we not support that? We definitely should. It's just the capital form of "i" in Turkish.

NilsEnevoldsen commented 10 months ago

Does our regex engine have an equivalent of CultureInvariant, or is that a dotnet-exclusive nicety?

vincentarelbundock commented 10 months ago

Does our regex engine have an equivalent of CultureInvariant, or is that a dotnet-exclusive nicety?

Oh, that's neat. I had to google it because I had never heard of this flag. Also funny that the C# docs use exactly our problematic case as their main example.

Maybe I missed something, but I don't think grep(perl=TRUE) supports this out of the box.

Screenshot 2023-09-09 145903

cjyetman commented 10 months ago

We currently rely on ignore.case = TRUE in our grepl() command here: https://github.com/vincentarelbundock/countrycode/blob/848e9ce363026c98a38daac9777005925f18f582/R/countrycode.R#L280-L281

Should we consider casting all strings-to-match with tolower() before using grepl()? Probably a small performance hit, but maybe worth it. If we ensure that all of our regexes are lowercase, we could then maybe remove the ignore.case = TRUE, which would probably give a small performance boost.

turkey <- "TÜRKİYE"

countrycode::countrycode(turkey, "country.name", "country.name")
#> Warning: Some values were not matched unambiguously: TÜRKİYE
#> [1] NA

tolower(turkey)
#> [1] "türkiye"
countrycode::countrycode(tolower(turkey), "country.name", "country.name")
#> [1] "Turkey"

turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]
grepl(x = turkey, pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] FALSE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = FALSE)
#> [1] TRUE
vincentarelbundock commented 10 months ago

FWIW, I don't have a view and not a real good sense of whether this can cause problems (probably not?).

cjyetman commented 10 months ago

It's not the most precise solution, but might capture similar issues with other extended characters, while seemingly not changing the logic at all.

A more precise solution could be to add İ to the regexes for Turkey...

turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]

grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1]  TRUE  TRUE  TRUE FALSE

turkey.en.regex
#> [1] "turkey|t(ü|u)rkiye"
turkey.en.regex_mod <- "turkey|t(ü|u)rk(i|İ)ye"

grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE

side note: I don't know why we use () (capture groups) instead of []

turkey.en.regex_mod <- "turkey|t[ü|u]rk[i|İ]ye"
grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE
vincentarelbundock commented 10 months ago

Let's just go with the more precise option, then, as there are fewer unknowns.

What's the benefit of ()? We don't want to capture anything...

cjyetman commented 10 months ago

Let's just go with the more precise option, then, as there are fewer unknowns.

👍🏻

What's the benefit of ()? We don't want to capture anything...

I have no idea. Honestly, I'm somewhat surprised that it works as expected using (). My understanding is that "character class" ([]) is the proper way of matching any one of a list of characters. At some point, a bunch of capture groups (()) got introduced into the regexes here.

vincentarelbundock commented 10 months ago

weird. I can see it for lookaheads and such... oh well!