Regex: common abbreviations

Olucik commented 7 years ago

countrycode("Centr. African Rep.", "country.name", "country.name") should result in "Central African Republic" countrycode("Dominic. Republic.", "country.name", "country.name") should result in "Dominican Republic" countrycode("Kuweit", "country.name", "country.name") should result in "Kuwait" countrycode("Timor", "country.name", "country.name") should result in "Timor Leste" countrycode("Korea, Democratic Republic of", "country.name", "country.name") should result in "Democratic People's Republic of Korea"

vincentarelbundock commented 7 years ago

Thanks for the report, but I'm not sure about some of these.

"Korea, Democratic Republic of" is picked up properly by the latest version of the package (install from github using devtools)
Kuweit is a misspelling (or different language). I don't think countrycode should be expected to fix all possible misspellings. That's a very deep rabbit hole.
Same thing with abbreviations. I'm not completely opposed to supporting some very widespread abbreviations (e.g., USA), but "Dominic" instead of "Dominican" seems like a dataset-specific issue more than a common case that countrycode ought to support. There's a downside to including too many wildcards in regexes, since it increases the chance that we'll pick up false positives.
Timor is not the same as Timor-Leste, and we probably want to pickup only those that match with the corresponding ISO codes in the countrycode conversion dictionary.

I'm not completely closed on any of these ideas, but an actual argument should be made before effort is expended.

vincentarelbundock commented 7 years ago

Also, you might want to look into the custom_match argument.

Olucik commented 7 years ago

thank you "Korea, Democratic People's Republic of" works, but "Korea, Democratic Republic of" does not work with the current dev version

vincentarelbundock commented 7 years ago

I see. Is that a common form? And how would you modify the regex?

Olucik commented 7 years ago

it's how Social Progress Index names it (for all the years). I haven't seen it anywhere else yet. May be it still falls under "custom" category

vincentarelbundock commented 7 years ago

good to know. I just tried googling the expression in quotes, and can't really find other instances. I think I'll leave this issue open for future consideration, but not do anything for now.

Sorry if that seems unresponsive, I'm just not 100% convinced by the use-case, and I'm super busy at work these days (and trying to minimize non-essential tasks).

Olucik commented 7 years ago

thank you!

cjyetman commented 6 years ago

Just hit this one, "Central African Rep.", in CEPII IPD 2012...

> countrycode("Central African Rep.", "country.name", "country.name")
[1] NA
Warning message:
In countrycode("Central African Rep.", "country.name", "country.name") :
  Some values were not matched unambiguously: Central African Rep.

vincentarelbundock commented 6 years ago

What's your view on abbreviations? Should those be countrycode's responsibility? I suppose we could make it countrycode's responsibility, but slippery slope, etc.

cjyetman commented 6 years ago

I think we're already on that slope since we currently do: "Korea, Rep. of", "Rep. of Korea", "U.S.A.", "D.P.R. Korea", "D.R. Congo", "U.S. Virgin Islands", etc. Since there's a finite number of country names, and an even smaller finite list of words/names within that which can be reasonably abbreviated, that slope may be slippery, but not so long.

And since the regex codes are really the core value of the package, than maybe they should be as adaptable as possible? Of course, what does/should trump all of that is who/when/how does it get done, does it negatively affect anything else, etc., which are completely valid concerns also.

vincentarelbundock commented 6 years ago

I'm convinced.

cjyetman commented 6 years ago

changed the title because this thread eventually led to an agreement that some abbreviations should be considered for addition to the regexes

vincentarelbundock commented 2 years ago

Thanks again for opening this issue. If someone has very specific suggestions for changes to the regular expressions, I encourage them to create a Pull Request by modifying the dictionary/data_regex.csv file.

For now, we'll close this issue to tidy up the repo.

vincentarelbundock / countrycode

Regex: common abbreviations #164