Open vincentarelbundock opened 6 years ago
Not sure when I'll have time to review this in depth, but...
I was just pondering something like this recently... having a separate lookup table for each possible origin-destination pair. Seems a bit complicated, but it should be manageable to create in the dictionary creation code, with precise, traceable code for any specific decisions about matches that need to be made. I'm starting to think this is a better idea than what I was proposing here. My main fear would be how much this increases the size of the package due to heavy duplication of data, especially now that we have all of these cldr language variations.
This probably will cause problems, or a least force changes with the custom dictionary feature.
This could be problematic for other packages that are using codelist
directly, though that's never how it was meant to be used anyway (afaik). In the same vein, we no longer include a large CSV lookup table in the repo, which I think some people were pulling for use in their own projects (hopefully with attribution).
No need to review this in-depth. The code is nowhere near ready, so I'd rather you preserve whatever reviewing energy for later. At this stage, I'm more interested in high-level design input.
FWIW, the compressed binary with every single uni-directional map weighs 912K. I'm not sure if that's a big problem or not (I wouldn't mind, but I live somewhere with reasonably fast internet).
An alternative would be to do merge on the fly, which would cut down on package size but impose a small compute penalty everytime countrycode
is invoked. Maybe there's a way to cache it.
I think it would be trivial to host a CSV file on github and a codelist in the main package for convenience.
A clean solution might be to hold each code in a separate data frame with three columns: country.name.en.regex, code, unique_target
. Then, we merge those dictionaries on the fly, and use the memoise
package to speed-up repeated invocations.
Here, memoise
could be a suggests rather than a depends.
Merging issues. See discussion here: https://github.com/vincentarelbundock/countrycode/issues/180
One nagging problem with
countrycode
(e.g., https://github.com/vincentarelbundock/countrycode/issues/182 https://github.com/vincentarelbundock/countrycode/issues/180 ) is that the current approach tocodelist
strictly requires bidirectional one-to-one mappings.This is problematic in cases where we want:
Russia -> RUS (iso) USSR -> RUS (iso) RUS -> Russia
I have been trying to find a solution forever without much result. Today, I pushed a (nearly working) branch with a potential path forward: https://github.com/vincentarelbundock/countrycode/tree/manytoone
The concept:
countrycode
. This means, for example, that we need a different regexes for Russia and USSR because Correlates of War treat them separately.dictionary/merge.R
codelist
internally, we usecodelist_map
, which is a list of lists of data.frames. For example, if we want to convert from cowc to iso3c, we usecodelist_map$cowc$iso3c
, which is a data.frame with only two columns.One key, for me is number 4 above, and right now too much still happens in the
get_*
functions. Theget
functions should just be scrapers, and users should have access to a well-document script to see how we reconciled origin vs. destination.Curious what @cjyetman thinks of this.