Closed rfilmyer closed 8 years ago
This should absolutely be fixed, and appropriate unit tests should be added to test-regex-special.R. I am surprised that this has not been reported previously.
Note: Beware matching People's Republic of China
Note: ROC should return NA rather than Taiwan, as ROC may also mean the Republic of Congo.
Do you have a regex to suggest? Current for China is:
^(?!.\bmac)(?!.\bhong)(?!._\btai)._china
I haven't tested this, so it probably contains errors:
China: ^(?!.\bmac)(?!.\bhong)(?!.\btai)(?!.\brep).china|^(?=.*peo)(?=.*rep).*china
In other words: does not contain “ Rep…” unless it also contains “ Peo…”
Taiwan: taiwan|taipei|formosa|^(?!.*peo)(?=.*rep).*china
In other words: may contain “ Rep…” but only if it doesn't contain “ Peo…”
Similar to Yemen madness:
expect_that(cowc_of('republic of yemen'), equals('YEM'))
expect_that(cowc_of('yemen arab republic'), equals('YAR'))
expect_that(cowc_of('people\'s democratic republic of yemen'), equals('YPR'))
This still produces warnings, but the output seems correct (because the Taiwan observation is lower in the countrycode_data than the China observation, so China gets overwritten). Will have to do for now. Thanks!
https://github.com/vincentarelbundock/countrycode/commit/cb7ced3ea1ca1e09f8e0403af70c4dabaafbfe4a
3: In countrycode(name, "country.name", "iso3c", warn = TRUE) : Some strings were matched more than once: Taiwan, Province of China,CHN,TWN; United States Minor Outlying Islands,USA,UMI; Virgin Islands, U.S.,USA,VIR
4: In countrycode(name, "country.name", "iso3c", warn = TRUE) : Some strings were matched more than once: Republic of China,CHN,TWN
Try ^(?!.*\bmac)(?!.*\bhong)(?!.*\btai)(?!.*\brep).*china|^(?=.*peo)(?=.*rep).*china
. I'm not sure why all the .*
became .
in China's regex. I thought I copied straight from the CSV. Odd.
Sorry, that was me. Ran into a problem where it wouldn't recognize China at the start of a string ^China
Oh, that's puzzling. Did you add a unit test for that case?
just did
And did ^(?!.*\bmac)(?!.*\bhong)(?!.*\btai)(?!.*\brep).*china|^(?=.*peo)(?=.*rep).*china
fix it? Seems like it should.
Yes it did. Thanks you so much! I really appreciate it.
While the country name recognition is great, I started playing around it and noticed a geopolitically relevant flaw in a specific edge case: The territory officially named the "Republic of China" is also unofficially known as Taiwan, but
countrycode()
matches the string to the (People's Republic of) China instead.Example:
Expected behavior would be to map all of those strings to TW or Taiwan, Province of China, as appropriate.
I am using version 0.18, the latest (I believe) version from CRAN.