vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
346 stars 84 forks source link

Use dynamic `origin` and `destination` arg matching in `countrycode()` #270

Closed salim-b closed 1 year ago

salim-b commented 3 years ago

This gives way better error messages, cf. rlang::arg_match().

Plus some cosmetic changes.

vincentarelbundock commented 3 years ago

Thanks!

Going to bed now and the rest of the week is super busy, but I'll take a look as soon as I can.

Initial reaction: I use checkmate for a different package and I like it (very thin), but I also really liked the fact that countrycode has ZERO dependencies beyond base R. I'm not totally closed to the idea, but I'm also not 100% sure this is a merge right now.

Will resume discussion as soon as possible.

salim-b commented 3 years ago

Ok, looking forward to your review.

Regarding the dependencies:

cjyetman commented 3 years ago
  1. I would definitely prefer to not go from 0 dependencies to 4.
  2. I think this might conflict with custom dictionaries where we won't know the names of the columns, and hence the possible values for origin or destination.
vincentarelbundock commented 3 years ago
1. I would definitely prefer to **not** go from 0 dependencies to 4.

2. I think this might conflict with custom dictionaries where we won't know the names of the columns, and hence the possible values for `origin` or `destination`.

To be fair, utils and stats are Base R packages. Still, I agree with the general sentiment and I woke up even more firmly in the "No Dependencies" camp. So if @salim-b doesn't mind, I'd like to revert that aspect of the PR.

And I also agree that if we are going to use some kind of match.call, then this Issue needs to be resolved with a test as part of this PR: https://github.com/vincentarelbundock/countrycode/issues/271

In the README, there is an example with US States codes: the data is hosted on our github repo, and the example would be pretty much plug-and-play for test.

vincentarelbundock commented 3 years ago

Also, if we're not going to onboard the rlang dependency, it's not clear to me at all that there's a benefit of match.call over our reasonably informative stop message.

salim-b commented 3 years ago

I think this might conflict with custom dictionaries where we won't know the names of the columns, and hence the possible values for origin or destination.

Nope, this also works with custom_dict thanks to this code:

https://github.com/vincentarelbundock/countrycode/blob/4c70abadada57f1c7e805f00cecdbcfe3b693cf5/R/countrycode.R#L141-L148

This is also the reason why I've added the checkmate::assert_data_frame() check: If custom_dict is set to something else than a data frame (or to one without colnames), colnames(dictionary) evaluates to NULL resulting in rlang::arg_match() deriving the possible values from the caller frame – which would obviously be wrong when custom_dict is set.

Still, I agree with the general sentiment and I woke up even more firmly in the "No Dependencies" camp. So if @salim-b doesn't mind, I'd like to revert that aspect of the PR.

Ok, so you want me to remove both checkmate and rlang, right?

vincentarelbundock commented 3 years ago

Yeah, but just to be clear, I still want an informative error like we had before, so a match.call won’t do. We need something at least as informative as our old stop calls.

And if we remove checkmate and rlang, im not sure what the purpose of the PR really is. I haven’t reviewed it properly, but I wouldn’t want you to do a bunch more work if it ends up being about trailing white space.

But again, I haven’t reviewed, so I’m probably missing some other benefits...

On Wed, Mar 24, 2021, at 16:39, Salim B wrote:

I think this might conflict with custom dictionaries where we won't know the names of the columns, and hence the possible values for origin or destination.

Nope, this also works with custom_dict thanks to this code:

https://github.com/vincentarelbundock/countrycode/blob/4c70abadada57f1c7e805f00cecdbcfe3b693cf5/R/countrycode.R#L141-L148

This is also the reason why I've added the checkmate::assert_data_frame() check: If custom_dict is set to something else than a data frame (or to one without colnames), colnames(dictionary) evaluates to NULL resulting in rlang::arg_match() deriving the possible values from the caller frame – which would obviously be wrong when custom_dict is set.

Still, I agree with the general sentiment and I woke up even more firmly in the "No Dependencies" camp. So if @salim-b https://github.com/salim-b doesn't mind, I'd like to revert that aspect of the PR.

Ok, so you want me to remove both checkmate and rlang, right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vincentarelbundock/countrycode/pull/270#issuecomment-806171248, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQ7MKQYQH2CLKAQ73DFSLTFJEWLANCNFSM4ZWN3CWQ.

-- Vincent Arel-Bundock

Livre en libre accĂšs: Analyse Causale et MĂ©thodes Quantitatives http://arelbundock.com/acmq.html

Professeur agrégé / Associate professor Université de Montréal, Science politique 3150 rue Jean-Brillant, Pav. Lionel-Groulx, C-4020 Montréal, Québec, Canada, H3T 1N8 http://arelbundock.com

salim-b commented 3 years ago

I still want an informative error like we had before

With this PR, the error for a call like countrycode(c('USA', 'DZA'), origin = 'iso3cc', destination = 'cown') is e.g.:

Error: `origin` must be one of "cctld", "country.name", "country.name.de", "cowc", "cown", "dhs", "ecb", "eurostat", "fao", "fips", "gaul", "genc2c", "genc3c", "genc3n", "gwc", "gwn", "imf", "ioc", "iso2c", "iso3c", "iso3n", "p4c", "p4n", "un", "un_m49", "unicode.symbol", "unpd", "vdem", "wb", "wb_api2c", "wb_api3c", "wvs", "country.name.en.regex", or "country.name.de.regex".
Did you mean "iso3c"?
Run `rlang::last_error()` to see where the error occurred. 

With base::match.arg() it would be:

Error in match.arg(origin, formals()$origin) : 
  'arg' should be one of “c”, “cctld”, “country.name”, “country.name.de”, “cowc”, “cown”, “dhs”, “ecb”, “eurostat”, “fao”, “fips”, “gaul”, “genc2c”, “genc3c”, “genc3n”, “gwc”, “gwn”, “imf”, “ioc”, “iso2c”, “iso3c”, “iso3n”, “p4c”, “p4n”, “un”, “un_m49”, “unicode.symbol”, “unpd”, “vdem”, “wb”, “wb_api2c”, “wb_api3c”, “wvs”, “country.name.en.regex”, “country.name.de.regex” 

I think the latter is clearly less elegant but would still serve the main purpose of this PR. Let me know what you think.

vincentarelbundock commented 3 years ago

Got it, thanks for the clarification!

In that case, wouldn't it be better to just improve our stop call with something like this?

if (!is.character(origin) ||
    length(origin) != 1 ||
    !origin %in% valid_origin) {
  stop(sprintf("The `origin` argument must be a character of length 1 equal to one of these values: %s.",
                paste(valid_origin, collapse = ", ")))
}

That would give us both the explicit assertion for input checking and a pretty/informative error message.

vincentarelbundock commented 3 years ago

Also, FYI, all of this may change very soon because of this work in progress PR that is going to affect a lot of things:

https://github.com/vincentarelbundock/countrycode/pull/267/files

salim-b commented 3 years ago

Ok, then let's wait for the merge of #267 first?

vincentarelbundock commented 3 years ago

I just merged the branch, but I'm not planning to release it to CRAN any time soon. I need to play with the design of countrycode_factory quite a bit more before I do that.

vincentarelbundock commented 3 years ago

Feel free to take a look a the current state of affairs. But maybe open an issue with a proposal before diving too deep. I hate to see it when people put in efforts like checkmate implementation and we end up turning it down. Such a waste!

Thanks again for your interest and efforts!

cjyetman commented 3 years ago

I had to jump through hoops to get around the merge conflicts, but in my quick test of this PR, countrycode(c('USA', 'DZA')) returns a vector with NAs and warnings, which I think is undesirable...

> countrycode(c('USA', 'DZA'))
[1] NA NA
Warning messages:
1: In if (origin %in% c("country.name.en", "country.name.de")) { :
  the condition has length > 1 and only the first element will be used
2: In if (destination == "country.name") { :
  the condition has length > 1 and only the first element will be used
3: In countrycode(c("USA", "DZA")) :
  Some values were not matched unambiguously: DZA, USA

current release and dev versions return an error, which I believe is the correct behavior...

> countrycode(c('USA', 'DZA'))
Error in countrycode(c("USA", "DZA")) : 
  argument "origin" is missing, with no default
vincentarelbundock commented 3 years ago

Thanks a lot for looking into this @cjyetman . I pushed some changes last night to master that I think make much of this PR obsolete, because it reinforces argument checking and gives an informative message for valid origins.

Note that the current main branch also includes a new countrycode_factory function with which I have not played much, so I don't intend to publish to CRAN until I (and others) have had some time to test things out and improve on the current bare-bones functionality.