Closed cjyetman closed 2 years ago
Not sure. Honestly, I'm afraid this would end up being more work than it's worth, and that we would catch fewer cases. Take this:
> library(countrycode)
> example = c(
+ "Democratic Republic of the Congo",
+ "Democ. Repub. of the Congo",
+ "Dem. Rep. of the Congo",
+ "Dem. Rep. of Congo",
+ "Dem Rep of Congo",
+ "D.R. of the Congo",
+ "D.R. of Congo",
+ "D.R. Congo",
+ "DRCongo",
+ "Zaire")
> countrycode(example, 'country.name', 'iso3c')
[1] "COD" "COD" "COD" "COD" "COD" "COD" "COD" "COD" "COD" "COD"
Our current regex is complicated, true, but it works, and it could potentially catch many more variations than the ones I included here.
Looking at the static dict, I don't see that many "complicated" ones. Most are super straightforward. The complicated ones handle the weirder country names. Like you, I'm kind of afraid to touch those. But that's why we have tests, no?
The "ugly" ones are mostly lookbehinds. Maybe there's a cleaner way to handle that issue.
Yeah, I get the appeal of regex, and it can catch lots of variations... but I worry about the things that it can catch that were unexpected or unanticipated.
Like our regex for "Democratic Republic of the Congo", which is
"\\bdem.*congo|congo.*\\bdem|congo.*\\bd[\\.]?r|\\bd[\\.]?r.*congo|belgian.?congo|congo.?free.?state|kinshasa|zaire|l.opoldville|drc|droc|rdc"
just the first bit of that can catch things that were unintended (and honestly, I don't totally understand what all those regex variations are intending to catch), like..
grepl("\\bdem.*congo", "demolished the republic of congo")
# [1] TRUE
whereas this is maybe not as fancy, but doesn't have the same problem...
example <- c(
"Democratic Republic of the Congo",
"Democ. Repub. of the Congo",
"Dem. Rep. of the Congo",
"Dem. Rep. of Congo",
"Dem Rep of Congo",
"D.R. of the Congo",
"D.R. of Congo",
"D.R. Congo",
"DRCongo",
"Zaire")
drcongo_simple_regex <- paste(example, collapse = "|")
drcongo_simple_regex
# [1] "Democratic Republic of the Congo|Democ. Repub. of the Congo|Dem. Rep. of the Congo|Dem. Rep. of Congo|Dem Rep of Congo|D.R. of the Congo|D.R. of Congo|D.R. Congo|DRCongo|Zaire"
grepl(drcongo_simple_regex, "demolished the republic of congo")
# [1] FALSE
I'm not really arguing one way or another, just trying to consider options.
It would also be feasible to...
"\\bcentral.african.republic"
-> "\\bcentral.african.republic|Centr\\. African Rep\\."
)note to self: we'd have to be careful with regex special characters like "." also note: Github's auto-formatting messes with escape characters if you don't wrap it in back-ticks
A maybe more difficult problem is the exclusions like #170 "Northern Ireland", "N. Ireland", "N Ireland", which I guess need to be added to every alternation in the existing regex (in that case, there's luckily only one).
Just thinking out loud about how to avoid having issues like that sitting around for so long.
One reason why I haven't been that worried about about false positives is the intended use I had in mind originally: converting a column of country names.
countrycode, as currently designed, is not intended to distinguish a country name from ANY other string (e.g. "Demolished"). The only thing that matters is to not confuse DRC with other country names. (I could see an argument for relaxing this constraint; im not convinced that extracting names from any sentence is a common use case, but I could be convinced eventually. This is just the origin story.)
I agree that some of those are overly aggressive, and I'm not favor of using more pipes if that makes things more transparent.
But I would not want to lose the power of regexes to (a) automatically detect unexpected variations, and (b) not have to update the package every time a user finds a dataset with two (or three) blank spaces between words instead of one, or a different abbreviation of "Republic".
I've encountered so many different versions of the DRC in real life data that I'm really afraid of the maintainability implications of the proposed strategy, and of the number of false negatives (but maybe those are less bad.)
All this to say that I'm in favor of reducing complexity using pipes, but I would prefer to settle on a regexes syntax that remains flexible enough to catch unexpected variations.
I agree with you about the intended design, and that not including matching within sentences... but it has been mentioned before, e.g. #162. Personally, I never liked that idea much because I think that would encourage bad practice. I could imagine other cases though that are not sentences and at least a plausible thing to see, like "demographic - Republic of the Congo", but that's not the point anyway.
I'm not in favor of replacing the current regexes with something else, nor implementing a hard rule like "no more regex". Just feeling out whether it makes sense to allow incremental additions that are less fancy and more brute (they would still have to be valid regex, including using special characters for "." etc., and probably ideally use stuff like \\s+
for one or more whitespace characters between words), and maybe incremental simplifications of the more difficult current regex patterns if and when they become a problem.
Continually improving the testing regime is really probably the most beneficial thing to do though. I think we're on the same page with that.
Great. Agree with everything you wrote in that last post.
In time, we really need to scrap the most dangerous calls like .*
One more related/underlying question... do we have a preference for avoiding false positives or false negatives? Seems like avoiding false positives fits in with the philosophy elsewhere (like preferring exact, current country codes versus convenient ones that are not technically correct). Though the preference here does not necessarily have to be the same.
I'm worried we're heading down a ROC curve rabbit hole ;)
I'm not sure there can even be an actual answer on this one. User feedback seems quite positive, and my dog fooding too, which suggests we may get be close to the sweet spot.
I'd definitely want to remove the crazy .*
stuff, but lean toward status quo otherwise. Again, I'm very willing to be convinced otherwise, and if we move in one direction, it'll probably have to be in a more conservative NA-heavy path.
In any case, it might be good to document the possibility of false positive when country code is used in unexpected contexts.
As the original author of many of these absurd patterns, I just want to say that indeed, the intent was to catch as many undiscovered variations as possible. There's a circularity happening as well, which is that the more variations are common, the more we want to hedge against other undiscovered variations.
Thanks for this comment @NilsEnevoldsen . I agree.
Recently, I modified the regex for Vietnam along the lines suggested by @cjyetman because it was overaggressive and was mixing up the different variations of "Vietnam" and "Republic of Vietnam" and others. When there is a high risk of confusion, I think we should be conservative. But that only touches a few cases:
For most of the other countries, experience suggests that it's perfectly fine to be aggressive and use those "absurd" regex patterns.
Should this be closed since it's not exactly an open issue, though the conversation above is certainly important and relevant going forward?
linked in the main regex improvement issue. closing since there's no immediate to take action on this.
Is there any advantage to having complicated regexes versus a list of alterations of known variations, other than brevity?
For example, U.S. Virgin Islands... "^(?=.\bu\.?\s?s).virgin|^(?=.states).virgin" versus "United States Virgin Islands|US Virgin Islands|U.S. Virgin Islands|Virgin Islands US|Virgin Islands U.S.|Virgin Islands, US|Virgin Islands, U.S."
which could easily be generated using a list such as our country_name_known_variations.json
I feel like it would be easier to maintain, easier to add new variations, less error prone, and honestly I would be less hesitant to mess with them. I'd even be motivated to build a list of all known variations for all countries based on all of our current code sets, plus a bunch of commonly used country level datasets, which would likely cover a very high percentage of possibilities.
The only minor difficulty I see would be adding exclusions, like preventing "Germany" from matching "East Germany", but that seems easy enough compared to managing complicated regexes for everything.
@vincentarelbundock do you see any major major problems with that (other than the fact that we currently use regex as the unique key for everything)? any thoughts on that?