vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 83 forks source link

Separate americas to North and South America for continent? #288

Closed huizezhang-sherry closed 2 years ago

huizezhang-sherry commented 2 years ago

In the codelist, possible values of the continent column include Asia, Europe, Africa, Oceania, Americas, and NA, while it could be more useful to separate "Americas" into North and South America.

Data sources that maps country into continent include here, here and here. In these sources, continents are divided into


I'm happy to make PR if the maintainer likes this idea.

vincentarelbundock commented 2 years ago

Thanks a lot for the report!

Continents are tricky because, as the wikipedia entry about them notes, they are defined by convention rather than authority. As such, different sources will use different groupings. There are basically two separate issues here:

  1. NA values in the current continent code.
  2. Should we replace our current version or add new variations?

Issue 1

Absolutely! We should definitely fill-in those missing values. This should be done by adding new rows to this file: https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_regions.csv

The missing entries are:

cl <- countrycode::codelist
cl[is.na(cl$continent), "country.name.en"] |> paste(collapse = ", ")
# [1] "c(\"Antarctica\", \"Austria-Hungary\", \"Baden\", \"Bavaria\", \"Bouvet Island\", \"British Indian Ocean Territory\", \"Brunswick\", \"Channel Islands\", \"Cocos (Keeling) Islands\", \"Czechoslovakia\", \"French Southern Territories\", \"German Democratic Republic\", \"Hamburg\", \"Hanover\", \"Heard & McDonald Islands\", \"Hesse Electoral\", \"Hesse Grand Ducal\", \"Hesse-Darmstadt\", \"Hesse-Kassel\", \"Kosovo\", \"Mecklenburg Schwerin\", \"Modena\", \"Nassau\", \"Oldenburg\", \"Orange Free State\", \"Parma\", \"Piedmont-Sardinia\", \"Prussia\", \"Sardinia\", \n\"Saxe-Weimar-Eisenach\", \"Saxony\", \"Serbia and Montenegro\", \"South Georgia & South Sandwich Islands\", \"Tuscany\", \"Two Sicilies\", \"United Province CA\", \"United States Minor Outlying Islands (the)\", \"Wuerttemburg\", \"Würtemberg\", \"Yemen Arab Republic\", \"Yemen People's Republic\", \"Yugoslavia\", \"Zanzibar\")"

Issue 2

A few things to keep in mind here:

  1. It is very important to maintain backward compatibility. Unless it is absolutely necessary, I don’t want to break anyone’s old code who relied on the current definition of continent.
  2. I want to avoid the proliferation of codes. As noted above, continents are a convention, so there are several variations. Likewise, “regions” can be defined in many ways. We include a few in countrycode, but wanting to cover them all is a deeeep rabbit hole.
  3. To the extent possible, we prefer to add new codes from official sources, where “official” loosely defined as coming from a government, international organization, or research group.

In sum, to make a change we would need to make sure that it either fixes an uncontroversial error in the current codes, or that it adds significant value.

Note that countrycode already includes several other region grouping, which would make it very easy for you to split North/South America. For example, checkout the un.regionsub.name:

countrycode::countrycode("Canada", "country.name", "un.regionsub.name")
# [1] "Northern America"
vincentarelbundock commented 2 years ago

Since we already have several region/continent codes, and since I am unlikely to work on this myself, I will close the issue to clean the repo. But anyone should feel free to re-open or keep commenting if they intend to prepare submit code via a pull request.