vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
346 stars 84 forks source link

UN vote code not consistent #276

Closed georgeyean closed 3 years ago

georgeyean commented 3 years ago

Hi Vincent,

It looks like the UN Vote country name is not consistent with "un".

For example, Venezuela, Bolivarian Republic of is in: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/LEJUQZ/DYNZPA&version=27.0

but your un csv has: Venezuela, (Bolivarian Republic of)

Do you want to create one for un vote version? I can help to create one and commit the code.

Thanks George

vincentarelbundock commented 3 years ago

Hi @georgeyean , thanks a lot for bringing this to our attention. I appreciate it!

Our UN codes are the official M49 codes. You can find the original source from the UN website that we use here:

https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/get_un.R

It's quite possible that the country names are slightly different in the Voeten datasets, but I think it's fair to say that the M49 spreadsheet from the UN website that we use is more "authoritative" in that regard than a Dataverse link, in terms of representing UN terminology.

Moreover, I have to say that I'm not very keen on adding new country names to the dataset. countrycode already includes hundreds of different versions of country names (via CLDR and others). Most importantly, I don't see much value in adding new country name variations because, in my opinion, it is extremely bad practice to merge datasets based on country names. There are so many variations of those, that getting a "standardized" version is near impossible.

A much better approach is to always convert your country names into a recognized, standardized code (numeric or alpha). Then, use that to merge data. Finally, convert back to country names using a known, standardized output like the CLDR unicode short names (or other).

All this to say that I'm not convinced we need to add a new column to our dictionary to accomodate the idiosyncracies of the UN voting replication dataset from DataVerse. (That dataset is super cool, by the way; I've used it before for research!)

georgeyean commented 3 years ago

Thanks Vincent! Much agree that merging data should use code not name. I just thought as your cool library converts names (for display etc) as a utility, maybe it's better to add this famous UN vote version. But anyway, just a thought. Thanks!

vincentarelbundock commented 3 years ago

Yeah, yeah, totally understand the impulse. I've just been maintaining this for a while, and am weary of changing stuff to match commas and parentheses :)