vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
342 stars 84 forks source link

"Some strings were matched more than once, and therefore set to <NA>" - Add option to keep first value #334

Closed rempsyc closed 1 year ago

rempsyc commented 1 year ago

Hi Vincent,

Let me give you a bit of context about my current project in which I use countrycode. I fetch academic papers meta-data through the PubMed API to see the % of first authors from different countries and continents. PubMed only gives me a long address string, so I have to parse that to extract university, then match the university to country through a different data base, and finally country to continent through countrycode. The data dashboard can be visualized here:

https://remi-theriault.com/dashboards/neglected_95

This approach misses a lot of the countries, so in a second step, for missing values, I reparse the address to see if there's any country mentioned, through a reverse regex search (does any of the countries listed in countryname_dict$country.name.en appear in the address pattern x).

However, I realized that countrycode::countrycode can already do that:

x <- "University, China, province"
countrycode::countrycode(x, "country.name", "country.name")
#> [1] "China"

x <- "University, China, Mongolia"
countrycode::countrycode(x, "country.name", "country.name")
#> Warning: Some values were not matched unambiguously: University, China, Mongolia
#> Warning: Some strings were matched more than once, and therefore set to <NA> in the result: University, China, 
#> Mongolia,China,Mongolia
#> [1] NA

Created on 2023-07-08 with reprex v2.0.2

However, sometimes, the address contains two countries (e.g., two affiliations for the same author), and I would like to be able to keep only the first one, in order to reduce the number of missing values within my data set. Making mistakes here is not as problematic as the high number of missing values, so I'm ok with making errors. Therefore, I wish countrycode::countrycode would have an additional argument that lets users decide what to do when there are several matches, which would defaults to NA, but which could also be set to keeping the first (or the last) match.

Would something like this be possible? Correcting the address strings at the source is not realistic giving the high volume of data, so the only other alternative would be for me to write an equivalent regex that can do the same.

Note: This is related to #94

vincentarelbundock commented 1 year ago

@rempsyc this sounds like a very cool project.

Your suggestion is interesting, but I'll have to think about it because I'm a bit afraid it would encourage bad practices. (Not in your case, but in more "standard" contexts.)

In any case, you should know that the real value of countrycode is its dictionary. The code is trivial.

You can almost certainly achieve what you want with a 4 line function. Just loop over the elements of codelist$country.name.en.régex and call grep(perl=TRUE) on each of them.

You'll get the same result as with the option you suggest.

rempsyc commented 1 year ago

I can understand your concern about encouraging bad practices, so I think we can close this issue actually. The workaround I've been using is Ecfun::rgrep (essentially the same as your describe) and it works pretty well, although it is slow on my big data base (even with fixed = TRUE). Thanks, and happy to have been able to share my project with you so you can see how countrycode is being used :P