Regex: simplification - Githubissues

cjyetman commented 6 years ago

Is there any advantage to having complicated regexes versus a list of alterations of known variations, other than brevity?

which could easily be generated using a list such as our country_name_known_variations.json

library(jsonlite)
library(countrycode)
country_list <- fromJSON("tests/testthat/country_name_known_variations.json")
paste(country_list$`U.S. Virgin Islands`, collapse = "|")
# [1] "United States Virgin Islands|US Virgin Islands|U.S. Virgin Islands|Virgin Islands US|Virgin Islands U.S.|Virgin Islands, US|Virgin Islands, U.S."

I feel like it would be easier to maintain, easier to add new variations, less error prone, and honestly I would be less hesitant to mess with them. I'd even be motivated to build a list of all known variations for all countries based on all of our current code sets, plus a bunch of commonly used country level datasets, which would likely cover a very high percentage of possibilities.

The only minor difficulty I see would be adding exclusions, like preventing "Germany" from matching "East Germany", but that seems easy enough compared to managing complicated regexes for everything.

@vincentarelbundock do you see any major major problems with that (other than the fact that we currently use regex as the unique key for everything)? any thoughts on that?

vincentarelbundock commented 6 years ago

Not sure. Honestly, I'm afraid this would end up being more work than it's worth, and that we would catch fewer cases. Take this:

> library(countrycode)
> example = c( 
+ "Democratic Republic of the Congo",
+ "Democ. Repub. of the Congo",
+ "Dem. Rep. of the Congo",
+ "Dem. Rep. of Congo",
+ "Dem Rep of Congo",
+ "D.R. of the Congo",
+ "D.R. of Congo",
+ "D.R. Congo",
+ "DRCongo",
+ "Zaire")
> countrycode(example, 'country.name', 'iso3c')
 [1] "COD" "COD" "COD" "COD" "COD" "COD" "COD" "COD" "COD" "COD"

Our current regex is complicated, true, but it works, and it could potentially catch many more variations than the ones I included here.

Looking at the static dict, I don't see that many "complicated" ones. Most are super straightforward. The complicated ones handle the weirder country names. Like you, I'm kind of afraid to touch those. But that's why we have tests, no?

vincentarelbundock commented 6 years ago

The "ugly" ones are mostly lookbehinds. Maybe there's a cleaner way to handle that issue.

cjyetman commented 6 years ago

Yeah, I get the appeal of regex, and it can catch lots of variations... but I worry about the things that it can catch that were unexpected or unanticipated.

just the first bit of that can catch things that were unintended (and honestly, I don't totally understand what all those regex variations are intending to catch), like..

grepl("\\bdem.*congo", "demolished the republic of congo")
# [1] TRUE

whereas this is maybe not as fancy, but doesn't have the same problem...

example <- c( 
  "Democratic Republic of the Congo",
  "Democ. Repub. of the Congo",
  "Dem. Rep. of the Congo",
  "Dem. Rep. of Congo",
  "Dem Rep of Congo",
  "D.R. of the Congo",
  "D.R. of Congo",
  "D.R. Congo",
  "DRCongo",
  "Zaire")

drcongo_simple_regex <- paste(example, collapse = "|")

drcongo_simple_regex
# [1] "Democratic Republic of the Congo|Democ. Repub. of the Congo|Dem. Rep. of the Congo|Dem. Rep. of Congo|Dem Rep of Congo|D.R. of the Congo|D.R. of Congo|D.R. Congo|DRCongo|Zaire"

grepl(drcongo_simple_regex, "demolished the republic of congo")
# [1] FALSE

I'm not really arguing one way or another, just trying to consider options.

It would also be feasible to...

update country_name_known_variations.json to include all country name variations found in the code sets we use so that the testing is more robust
in the future, if we find a new country name variation that we're not catching, simply tack it on to the existing regex as an alternation, rather than letting it sit around for ages because no one's bold enough to modify the existing regex (e.g. "\\bcentral.african.republic" -> "\\bcentral.african.republic|Centr\\. African Rep\\.")

note to self: we'd have to be careful with regex special characters like "." also note: Github's auto-formatting messes with escape characters if you don't wrap it in back-ticks

A maybe more difficult problem is the exclusions like #170 "Northern Ireland", "N. Ireland", "N Ireland", which I guess need to be added to every alternation in the existing regex (in that case, there's luckily only one).

Just thinking out loud about how to avoid having issues like that sitting around for so long.

vincentarelbundock commented 6 years ago

One reason why I haven't been that worried about about false positives is the intended use I had in mind originally: converting a column of country names.

countrycode, as currently designed, is not intended to distinguish a country name from ANY other string (e.g. "Demolished"). The only thing that matters is to not confuse DRC with other country names. (I could see an argument for relaxing this constraint; im not convinced that extracting names from any sentence is a common use case, but I could be convinced eventually. This is just the origin story.)

I agree that some of those are overly aggressive, and I'm not favor of using more pipes if that makes things more transparent.

But I would not want to lose the power of regexes to (a) automatically detect unexpected variations, and (b) not have to update the package every time a user finds a dataset with two (or three) blank spaces between words instead of one, or a different abbreviation of "Republic".

I've encountered so many different versions of the DRC in real life data that I'm really afraid of the maintainability implications of the proposed strategy, and of the number of false negatives (but maybe those are less bad.)

vincentarelbundock commented 6 years ago

All this to say that I'm in favor of reducing complexity using pipes, but I would prefer to settle on a regexes syntax that remains flexible enough to catch unexpected variations.

cjyetman commented 6 years ago

I agree with you about the intended design, and that not including matching within sentences... but it has been mentioned before, e.g. #162. Personally, I never liked that idea much because I think that would encourage bad practice. I could imagine other cases though that are not sentences and at least a plausible thing to see, like "demographic - Republic of the Congo", but that's not the point anyway.

I'm not in favor of replacing the current regexes with something else, nor implementing a hard rule like "no more regex". Just feeling out whether it makes sense to allow incremental additions that are less fancy and more brute (they would still have to be valid regex, including using special characters for "." etc., and probably ideally use stuff like \\s+ for one or more whitespace characters between words), and maybe incremental simplifications of the more difficult current regex patterns if and when they become a problem.

Continually improving the testing regime is really probably the most beneficial thing to do though. I think we're on the same page with that.

vincentarelbundock commented 6 years ago

Great. Agree with everything you wrote in that last post.

In time, we really need to scrap the most dangerous calls like .*

cjyetman commented 6 years ago

One more related/underlying question... do we have a preference for avoiding false positives or false negatives? Seems like avoiding false positives fits in with the philosophy elsewhere (like preferring exact, current country codes versus convenient ones that are not technically correct). Though the preference here does not necessarily have to be the same.

vincentarelbundock commented 6 years ago

I'm worried we're heading down a ROC curve rabbit hole ;)

I'm not sure there can even be an actual answer on this one. User feedback seems quite positive, and my dog fooding too, which suggests we may get be close to the sweet spot.

I'd definitely want to remove the crazy .* stuff, but lean toward status quo otherwise. Again, I'm very willing to be convinced otherwise, and if we move in one direction, it'll probably have to be in a more conservative NA-heavy path.

In any case, it might be good to document the possibility of false positive when country code is used in unexpected contexts.

NilsEnevoldsen commented 5 years ago

As the original author of many of these absurd patterns, I just want to say that indeed, the intent was to catch as many undiscovered variations as possible. There's a circularity happening as well, which is that the more variations are common, the more we want to hedge against other undiscovered variations.

vincentarelbundock commented 5 years ago

Thanks for this comment @NilsEnevoldsen . I agree.

Recently, I modified the regex for Vietnam along the lines suggested by @cjyetman because it was overaggressive and was mixing up the different variations of "Vietnam" and "Republic of Vietnam" and others. When there is a high risk of confusion, I think we should be conservative. But that only touches a few cases:

Vietnam
Congo
Yemen
Koreas

For most of the other countries, experience suggests that it's perfectly fine to be aggressive and use those "absurd" regex patterns.

cjyetman commented 4 years ago

Should this be closed since it's not exactly an open issue, though the conversation above is certainly important and relevant going forward?

vincentarelbundock commented 2 years ago

linked in the main regex improvement issue. closing since there's no immediate to take action on this.

vincentarelbundock / countrycode

Regex: simplification #197