vincentarelbundock / countrycode

R package: Convert country names and country codes. Assigns region descriptors.
https://vincentarelbundock.github.io/countrycode
GNU General Public License v3.0
346 stars 84 forks source link

Namibia is missing in some coding schemes #261

Closed dieghernan closed 3 years ago

dieghernan commented 3 years ago

Again Namibia. I have realised that in four coding schemes (eurostat, genc2c, wb_api2c, ecb) is missing since in all of them the 2-letter code is NA . See sources:

Reprex with the latest CRAN release

``` r library(countrycode) # Test countrycode("NAM", "iso3c", "iso2c") #> [1] "NA" countrycode("NAM", "iso3c", "eurostat") #> Warning in countrycode("NAM", "iso3c", "eurostat"): Some values were not matched unambiguously: NAM #> [1] NA # Analize df <- codelist # Filter Namibia check <- df[df$country.name.en == "Namibia",] # Check NA cols NAscol <- colnames(check)[is.na(check[1, ])] # Select no cldr fiels NAscol <- NAscol[-grep("cldr", NAscol)] NAscol #> [1] "ecb" "eu28" "eurostat" "genc2c" "wb_api2c" sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-w64-mingw32/x64 (64-bit) #> Running under: Windows 10 x64 (build 18363) #> #> Matrix products: default #> #> locale: #> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 #> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C #> [5] LC_TIME=Spanish_Spain.1252 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] countrycode_1.2.0 #> #> loaded via a namespace (and not attached): #> [1] compiler_3.6.1 magrittr_1.5 tools_3.6.1 htmltools_0.4.0 #> [5] yaml_2.2.1 Rcpp_1.0.4.6 stringi_1.4.6 rmarkdown_2.6 #> [9] highr_0.8 knitr_1.31 stringr_1.4.0 xfun_0.19 #> [13] digest_0.6.25 rlang_0.4.10 evaluate_0.14 ``` Created on 2021-02-10 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)

Reprex after PR

``` r library(countrycode) # Test countrycode("NAM", "iso3c", "iso2c") #> [1] "NA" countrycode("NAM", "iso3c", "eurostat") #> [1] "NA" # Analize df <- codelist # Filter Namibia check <- df[df$country.name.en == "Namibia",] # Check NA cols NAscol <- colnames(check)[is.na(check[1, ])] # Select no cldr fiels NAscol <- NAscol[-grep("cldr", NAscol)] NAscol #> [1] "eu28" sessionInfo() #> R version 4.0.3 (2020-10-10) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 16.04.7 LTS #> #> Matrix products: default #> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 #> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 #> #> locale: #> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 #> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 #> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C #> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] countrycode_1.2.0 #> #> loaded via a namespace (and not attached): #> [1] digest_0.6.27 assertthat_0.2.1 magrittr_2.0.1 reprex_1.0.0 #> [5] evaluate_0.14 highr_0.8 stringi_1.5.3 rlang_0.4.10 #> [9] cli_2.3.0 rstudioapi_0.13 fs_1.5.0 rmarkdown_2.6 #> [13] tools_4.0.3 stringr_1.4.0 glue_1.4.2 xfun_0.20 #> [17] yaml_2.2.1 compiler_4.0.3 htmltools_0.5.1.1 knitr_1.31 ``` Created on 2021-02-10 by the [reprex package](https://reprex.tidyverse.org) (v1.0.0)

Now only eu28 is missing, that it is ok (I leave out of the exercise the cldr* fields for clarity).

I have prepared a PR that hopefull fixes this issue,

Regards

cjyetman commented 3 years ago

Thanks! and, confirmed...

library(countrycode)
is.na(countrycode("NAM", "iso3c", "eurostat"))
#> Warning in countrycode("NAM", "iso3c", "eurostat"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "ecb"))
#> Warning in countrycode("NAM", "iso3c", "ecb"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "eu28"))
#> Warning in countrycode("NAM", "iso3c", "eu28"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "genc2c"))
#> Warning in countrycode("NAM", "iso3c", "genc2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "wb_api2c"))
#> Warning in countrycode("NAM", "iso3c", "wb_api2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE
cjyetman commented 3 years ago

I would say that this issue is not fully "fixed" until the scrapers for each of these codes has been fixed. Maybe each should be split into its own issue so that they can be addressed separately?

cjyetman commented 3 years ago

Also of note... now that this tidyverse/rvest/issues/107 has finally been resolved, we can probably make dictionary/get_ecb.R work better.

cjyetman commented 3 years ago

on the other hand, a similar issue in jsonlite is still unresolved, so still requires workarounds... jeroen/jsonlite/issues/98 jeroen/jsonlite/issues/314

vincentarelbundock commented 3 years ago

I'm not sure the problem is (entirely) related to our scrapers. It seems reader related to me. For instance, the "NA" string in data_genc.csv is correctly double-quoted in the raw CSV:

https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_genc.csv

The data.table::fread package does a good job of reading, but not read.csv nor read_csv:

setwd("~/repos/countrycode")
library(readr)
library(data.table)

# Base R
x = read.csv("dictionary/data_genc.csv")
"NA" %in% x$genc2c
#> [1] FALSE

# tidyverse
y = read_csv("dictionary/data_genc.csv")
"NA" %in% y$genc2c
#> [1] FALSE

# data.table
z = fread("dictionary/data_genc.csv")
"NA" %in% z$genc2c
#> [1] TRUE
vincentarelbundock commented 3 years ago

I made a minor commit with:

  1. New tests for Namibia
  2. A fix to the genc scraper with a Namibia-specific assertion

Obviously, if the saved data is not properly double-quoted, we should fix the scraper, but I'd like to get to the bottom of the read_csv issue instead because that feels like the more general solution.

https://github.com/vincentarelbundock/countrycode/commit/8f0ff1e6a792348673e12908cc765dcaccc1a1e1

vincentarelbundock commented 3 years ago

An even more minimal example:

library(readr)
library(data.table)

csv <- 'x,y
"1","NA"
"NA","2"'

str(read_csv(csv))
#> spec_tbl_df [2 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ x: num [1:2] 1 NA
#>  $ y: num [1:2] NA 2
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   x = col_double(),
#>   ..   y = col_double()
#>   .. )

str(fread(csv))
#> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
#>  $ x: chr  "1" "NA"
#>  $ y: chr  "NA" "2"
#>  - attr(*, ".internal.selfref")=<externalptr>
vincentarelbundock commented 3 years ago

Maybe we just set na.strings to "" in read_csv.

vincentarelbundock commented 3 years ago

Sorry for the multiple comments, but I pushed a change to add a bunch of na="" everywhere. This seems to fix everything, and my new tests now pass.

I think we're good to close, but it would be great if either of you could make sure the github version works locally.

dieghernan commented 3 years ago

Hi! So now it seems that some real NA are treated as "NA" (see https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/codelist_without_cldr.csv col eu28).

I was not able to check it locally yet, but as per my limited knowledge of the package, I guess that if checks are passed it’s because those new "NA" are on destination only fields (not really sure about this...)

I wonder if it could be possible to add a extra sanity check on dictionary/build.R that convert "NA" back to NA on destination-only fields, to avoid confusion.

Does it make any sense?

vincentarelbundock commented 3 years ago

That would be more explicit, but countrycode uses a strict one-to-one mapping between codes. So in principle if it works in one direction it will work in the other. Here, for example, we have:

library(countrycode)  

countrycode("NA", "genc2c", "country.name")            
 "Namibia"

countrycode(NA, "genc2c", "country.name")              
Error in countrycode(NA, "genc2c", "country.name") : 
  sourcevar must be a character or numeric vector. This error often
             arises when users pass a tibble (e.g., from dplyr) instead of a
             column vector from a data.frame (i.e., my_tbl[, 2] vs. my_df[, 2]
                                              vs. my_tbl[[2]])

The error is not super informative, I'll admit that ;)

cjyetman commented 3 years ago

The "proper" way to deal with this in readr is to set the na argument, which by default is na = c("", "NA")

readr::read_csv('x,y\n"US","NA"\n"NA","DE"')
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 US    <NA> 
#> 2 <NA>  DE
readr::read_csv('x,y\n"US","NA"\n"NA","DE"', na = "")
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 US    NA   
#> 2 NA    DE
vincentarelbundock commented 3 years ago

@cjyetman this is exactly what I did everywhere in my new commit.

dieghernan commented 3 years ago

One thing maybe I didn’t explain well is that the only scrapper that was not working properly was eurostat, at least for the four coding schemes I mentioned. The other three displayed the value on the csv as "NA".

cjyetman commented 3 years ago

Also pay attention to where the CSVs are being written (readr::format_csv is equivalent to readr::write_csv except that it returns the string rather than writing it to a file)...

readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,')
#> # A tibble: 3 x 2
#>   x     y    
#>   <lgl> <lgl>
#> 1 NA    NA   
#> 2 NA    NA   
#> 3 NA    NA
# all are converted to <NA>s

readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")
#> # A tibble: 3 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 NA    NA   
#> 2 NA    NA   
#> 3 <NA>  <NA>
# only the last row is converted to <NA>s

data <- readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")

readr::format_csv(data)
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"

readr::format_csv(data, na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"

technically, a string should not be quoted unless it's necessary

string <- 'x,y\n"NA","NA"\nNA,NA\n,'

readr::format_csv(readr::read_csv(string))
#> [1] "x,y\nNA,NA\nNA,NA\nNA,NA\n"

readr::format_csv(readr::read_csv(string, na = ""))
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"

readr::format_csv(readr::read_csv(string, na = ""), na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"
cjyetman commented 3 years ago

@cjyetman this is exactly what I did everywhere in my new commit.

I think that's the best thing to do... but again, just be careful that if any CSVs are written that they don't write <NA> as NA (without the quotes), which some CSV writers will do by default (for instance readr, e.g. readr::format_csv(data.frame(x = NA)) # [1] "x\nNA\n".

vincentarelbundock commented 3 years ago

Yes, I added na=“” to write calls too.

I think this is fixed. Feel free to reopen or comment if it still fails after reinstall from GH

cjyetman commented 3 years ago

better example of why you need to be careful of both ends of the round trip...

my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")))
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A"  "NA"
# BAD

my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")), na = "")
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A" NA
# GOOD

if na = "" is used everywhere for readr::read_csv, the same has to be used everywhere for readr::write_csv