"Republic of China" (Taiwan's official name) matches to China

vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.

https://vincentarelbundock.github.io/countrycode

GNU General Public License v3.0

346 stars 84 forks source link

"Republic of China" (Taiwan's official name) matches to China #43

Closed rfilmyer closed 8 years ago

rfilmyer commented 9 years ago

While the country name recognition is great, I started playing around it and noticed a geopolitically relevant flaw in a specific edge case: The territory officially named the "Republic of China" is also unofficially known as Taiwan, but countrycode() matches the string to the (People's Republic of) China instead.

Example:

> countrycode("Republic of China", "country.name", "iso2c")
[1] "CN"
> countrycode("Republic of China", "country.name", "country.name")
[1] "China"
> countrycode("TW", "iso2c", "country.name")
[1] "Taiwan, Province of China"
> countrycode("Taiwan", "country.name", "country.name")
[1] "Taiwan, Province of China"

Expected behavior would be to map all of those strings to TW or Taiwan, Province of China, as appropriate.

I am using version 0.18, the latest (I believe) version from CRAN.

NilsEnevoldsen commented 9 years ago

This should absolutely be fixed, and appropriate unit tests should be added to test-regex-special.R. I am surprised that this has not been reported previously.

Note: Beware matching People's Republic of China

Note: ROC should return NA rather than Taiwan, as ROC may also mean the Republic of Congo.

vincentarelbundock commented 8 years ago

Do you have a regex to suggest? Current for China is:

^(?!.\bmac)(?!.\bhong)(?!._\btai)._china

NilsEnevoldsen commented 8 years ago

I haven't tested this, so it probably contains errors:

China: ^(?!.\bmac)(?!.\bhong)(?!.\btai)(?!.\brep).china|^(?=.*peo)(?=.*rep).*china

In other words: does not contain “ Rep…” unless it also contains “ Peo…”

Taiwan: taiwan|taipei|formosa|^(?!.*peo)(?=.*rep).*china

In other words: may contain “ Rep…” but only if it doesn't contain “ Peo…”

Similar to Yemen madness:

    expect_that(cowc_of('republic of yemen'), equals('YEM'))
    expect_that(cowc_of('yemen arab republic'), equals('YAR'))
    expect_that(cowc_of('people\'s democratic republic of yemen'), equals('YPR'))

vincentarelbundock commented 8 years ago

This still produces warnings, but the output seems correct (because the Taiwan observation is lower in the countrycode_data than the China observation, so China gets overwritten). Will have to do for now. Thanks!

https://github.com/vincentarelbundock/countrycode/commit/cb7ced3ea1ca1e09f8e0403af70c4dabaafbfe4a

3: In countrycode(name, "country.name", "iso3c", warn = TRUE) : Some strings were matched more than once: Taiwan, Province of China,CHN,TWN; United States Minor Outlying Islands,USA,UMI; Virgin Islands, U.S.,USA,VIR

4: In countrycode(name, "country.name", "iso3c", warn = TRUE) : Some strings were matched more than once: Republic of China,CHN,TWN

NilsEnevoldsen commented 8 years ago

Try ^(?!.*\bmac)(?!.*\bhong)(?!.*\btai)(?!.*\brep).*china|^(?=.*peo)(?=.*rep).*china. I'm not sure why all the .* became . in China's regex. I thought I copied straight from the CSV. Odd.

vincentarelbundock commented 8 years ago

Sorry, that was me. Ran into a problem where it wouldn't recognize China at the start of a string ^China

NilsEnevoldsen commented 8 years ago

Oh, that's puzzling. Did you add a unit test for that case?

vincentarelbundock commented 8 years ago

just did

NilsEnevoldsen commented 8 years ago

And did ^(?!.*\bmac)(?!.*\bhong)(?!.*\btai)(?!.*\brep).*china|^(?=.*peo)(?=.*rep).*china fix it? Seems like it should.

vincentarelbundock commented 8 years ago

Yes it did. Thanks you so much! I really appreciate it.

vincentarelbundock commented 8 years ago

https://github.com/vincentarelbundock/countrycode/commit/ef1ead964be6215885bcc9125f8f9fc2a0a8c094