Open cjyetman opened 10 months ago
Do we not support that? We definitely should. It's just the capital form of "i" in Turkish.
Does our regex engine have an equivalent of CultureInvariant
, or is that a dotnet-exclusive nicety?
Does our regex engine have an equivalent of
CultureInvariant
, or is that a dotnet-exclusive nicety?
Oh, that's neat. I had to google it because I had never heard of this flag. Also funny that the C# docs use exactly our problematic case as their main example.
Maybe I missed something, but I don't think grep(perl=TRUE)
supports this out of the box.
We currently rely on ignore.case = TRUE
in our grepl()
command here:
https://github.com/vincentarelbundock/countrycode/blob/848e9ce363026c98a38daac9777005925f18f582/R/countrycode.R#L280-L281
Should we consider casting all strings-to-match with tolower()
before using grepl()
? Probably a small performance hit, but maybe worth it. If we ensure that all of our regexes are lowercase, we could then maybe remove the ignore.case = TRUE
, which would probably give a small performance boost.
turkey <- "TÜRKİYE"
countrycode::countrycode(turkey, "country.name", "country.name")
#> Warning: Some values were not matched unambiguously: TÜRKİYE
#> [1] NA
tolower(turkey)
#> [1] "türkiye"
countrycode::countrycode(tolower(turkey), "country.name", "country.name")
#> [1] "Turkey"
turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]
grepl(x = turkey, pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] FALSE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = FALSE)
#> [1] TRUE
FWIW, I don't have a view and not a real good sense of whether this can cause problems (probably not?).
It's not the most precise solution, but might capture similar issues with other extended characters, while seemingly not changing the logic at all.
A more precise solution could be to add İ
to the regexes for Turkey...
turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]
grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE FALSE
turkey.en.regex
#> [1] "turkey|t(ü|u)rkiye"
turkey.en.regex_mod <- "turkey|t(ü|u)rk(i|İ)ye"
grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE
side note: I don't know why we use ()
(capture groups) instead of []
turkey.en.regex_mod <- "turkey|t[ü|u]rk[i|İ]ye"
grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE
Let's just go with the more precise option, then, as there are fewer unknowns.
What's the benefit of ()
? We don't want to capture anything...
Let's just go with the more precise option, then, as there are fewer unknowns.
👍🏻
What's the benefit of
()
? We don't want to capture anything...
I have no idea. Honestly, I'm somewhat surprised that it works as expected using ()
. My understanding is that "character class" ([]
) is the proper way of matching any one of a list of characters. At some point, a bunch of capture groups (()
) got introduced into the regexes here.
weird. I can see it for lookaheads and such... oh well!
The list-one.xls that we use from https://www.six-group.com/en/products-services/financial-information/data-standards.html to get the ISO4217 codes in get_iso4217.R (or better, will use once the URL is updated in an PR #348) uses
TÜRKİYE
(note theİ
). Is this a valid spelling we should support?