vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
346 stars 84 forks source link

Many-to-One mappings #186

Open vincentarelbundock opened 6 years ago

vincentarelbundock commented 6 years ago

One nagging problem with countrycode (e.g., https://github.com/vincentarelbundock/countrycode/issues/182 https://github.com/vincentarelbundock/countrycode/issues/180 ) is that the current approach to codelist strictly requires bidirectional one-to-one mappings.

This is problematic in cases where we want:

Russia -> RUS (iso) USSR -> RUS (iso) RUS -> Russia

I have been trying to find a solution forever without much result. Today, I pushed a (nearly working) branch with a potential path forward: https://github.com/vincentarelbundock/countrycode/tree/manytoone

The concept:

  1. A unique regex identifies every single geographic unit covered by any of the schemes in countrycode. This means, for example, that we need a different regexes for Russia and USSR because Correlates of War treat them separately.
  2. Each destination code must be associated with one and only one regex: many-to-one
  3. origin codes can be associated with more than one regex: many-to-one
  4. This requires that we keep separate lists of origin and destination codes. The differences between origin and destination codes are handled explicitly in a centralized location: dictionary/merge.R
  5. instead of using codelist internally, we use codelist_map, which is a list of lists of data.frames. For example, if we want to convert from cowc to iso3c, we use codelist_map$cowc$iso3c, which is a data.frame with only two columns.

One key, for me is number 4 above, and right now too much still happens in the get_* functions. The get functions should just be scrapers, and users should have access to a well-document script to see how we reconciled origin vs. destination.

Curious what @cjyetman thinks of this.

cjyetman commented 6 years ago

Not sure when I'll have time to review this in depth, but...

I was just pondering something like this recently... having a separate lookup table for each possible origin-destination pair. Seems a bit complicated, but it should be manageable to create in the dictionary creation code, with precise, traceable code for any specific decisions about matches that need to be made. I'm starting to think this is a better idea than what I was proposing here. My main fear would be how much this increases the size of the package due to heavy duplication of data, especially now that we have all of these cldr language variations.

This probably will cause problems, or a least force changes with the custom dictionary feature.

This could be problematic for other packages that are using codelist directly, though that's never how it was meant to be used anyway (afaik). In the same vein, we no longer include a large CSV lookup table in the repo, which I think some people were pulling for use in their own projects (hopefully with attribution).

vincentarelbundock commented 6 years ago

No need to review this in-depth. The code is nowhere near ready, so I'd rather you preserve whatever reviewing energy for later. At this stage, I'm more interested in high-level design input.

FWIW, the compressed binary with every single uni-directional map weighs 912K. I'm not sure if that's a big problem or not (I wouldn't mind, but I live somewhere with reasonably fast internet).

An alternative would be to do merge on the fly, which would cut down on package size but impose a small compute penalty everytime countrycode is invoked. Maybe there's a way to cache it.

I think it would be trivial to host a CSV file on github and a codelist in the main package for convenience.

vincentarelbundock commented 6 years ago

A clean solution might be to hold each code in a separate data frame with three columns: country.name.en.regex, code, unique_target. Then, we merge those dictionaries on the fly, and use the memoise package to speed-up repeated invocations.

Here, memoise could be a suggests rather than a depends.

vincentarelbundock commented 4 years ago

Merging issues. See discussion here: https://github.com/vincentarelbundock/countrycode/issues/180