Closed dieghernan closed 3 years ago
Thanks! and, confirmed...
library(countrycode)
is.na(countrycode("NAM", "iso3c", "eurostat"))
#> Warning in countrycode("NAM", "iso3c", "eurostat"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "ecb"))
#> Warning in countrycode("NAM", "iso3c", "ecb"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "eu28"))
#> Warning in countrycode("NAM", "iso3c", "eu28"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "genc2c"))
#> Warning in countrycode("NAM", "iso3c", "genc2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "wb_api2c"))
#> Warning in countrycode("NAM", "iso3c", "wb_api2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE
I would say that this issue is not fully "fixed" until the scrapers for each of these codes has been fixed. Maybe each should be split into its own issue so that they can be addressed separately?
Also of note... now that this tidyverse/rvest/issues/107 has finally been resolved, we can probably make dictionary/get_ecb.R work better.
on the other hand, a similar issue in jsonlite is still unresolved, so still requires workarounds... jeroen/jsonlite/issues/98 jeroen/jsonlite/issues/314
I'm not sure the problem is (entirely) related to our scrapers. It seems reader related to me. For instance, the "NA" string in data_genc.csv
is correctly double-quoted in the raw CSV:
https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_genc.csv
The data.table::fread
package does a good job of reading, but not read.csv
nor read_csv
:
setwd("~/repos/countrycode")
library(readr)
library(data.table)
# Base R
x = read.csv("dictionary/data_genc.csv")
"NA" %in% x$genc2c
#> [1] FALSE
# tidyverse
y = read_csv("dictionary/data_genc.csv")
"NA" %in% y$genc2c
#> [1] FALSE
# data.table
z = fread("dictionary/data_genc.csv")
"NA" %in% z$genc2c
#> [1] TRUE
I made a minor commit with:
genc
scraper with a Namibia-specific assertionObviously, if the saved data is not properly double-quoted, we should fix the scraper, but I'd like to get to the bottom of the read_csv
issue instead because that feels like the more general solution.
https://github.com/vincentarelbundock/countrycode/commit/8f0ff1e6a792348673e12908cc765dcaccc1a1e1
An even more minimal example:
library(readr)
library(data.table)
csv <- 'x,y
"1","NA"
"NA","2"'
str(read_csv(csv))
#> spec_tbl_df [2 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ x: num [1:2] 1 NA
#> $ y: num [1:2] NA 2
#> - attr(*, "spec")=
#> .. cols(
#> .. x = col_double(),
#> .. y = col_double()
#> .. )
str(fread(csv))
#> Classes 'data.table' and 'data.frame': 2 obs. of 2 variables:
#> $ x: chr "1" "NA"
#> $ y: chr "NA" "2"
#> - attr(*, ".internal.selfref")=<externalptr>
Maybe we just set na.strings
to ""
in read_csv
.
Sorry for the multiple comments, but I pushed a change to add a bunch of na=""
everywhere. This seems to fix everything, and my new tests now pass.
I think we're good to close, but it would be great if either of you could make sure the github version works locally.
Hi! So now it seems that some real NA
are treated as "NA"
(see https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/codelist_without_cldr.csv col eu28).
I was not able to check it locally yet, but as per my limited knowledge of the package, I guess that if checks are passed it’s because those new "NA"
are on destination only fields (not really sure about this...)
I wonder if it could be possible to add a extra sanity check on dictionary/build.R
that convert "NA"
back to NA
on destination-only fields, to avoid confusion.
Does it make any sense?
That would be more explicit, but countrycode
uses a strict one-to-one mapping between codes. So in principle if it works in one direction it will work in the other. Here, for example, we have:
library(countrycode)
countrycode("NA", "genc2c", "country.name")
"Namibia"
countrycode(NA, "genc2c", "country.name")
Error in countrycode(NA, "genc2c", "country.name") :
sourcevar must be a character or numeric vector. This error often
arises when users pass a tibble (e.g., from dplyr) instead of a
column vector from a data.frame (i.e., my_tbl[, 2] vs. my_df[, 2]
vs. my_tbl[[2]])
The error is not super informative, I'll admit that ;)
The "proper" way to deal with this in readr
is to set the na
argument, which by default is na = c("", "NA")
readr::read_csv('x,y\n"US","NA"\n"NA","DE"')
#> # A tibble: 2 x 2
#> x y
#> <chr> <chr>
#> 1 US <NA>
#> 2 <NA> DE
readr::read_csv('x,y\n"US","NA"\n"NA","DE"', na = "")
#> # A tibble: 2 x 2
#> x y
#> <chr> <chr>
#> 1 US NA
#> 2 NA DE
@cjyetman this is exactly what I did everywhere in my new commit.
One thing maybe I didn’t explain well is that the only scrapper that was not working properly was eurostat
, at least for the four coding schemes I mentioned. The other three displayed the value on the csv as "NA".
Also pay attention to where the CSVs are being written (readr::format_csv
is equivalent to readr::write_csv
except that it returns the string rather than writing it to a file)...
readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,')
#> # A tibble: 3 x 2
#> x y
#> <lgl> <lgl>
#> 1 NA NA
#> 2 NA NA
#> 3 NA NA
# all are converted to <NA>s
readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")
#> # A tibble: 3 x 2
#> x y
#> <chr> <chr>
#> 1 NA NA
#> 2 NA NA
#> 3 <NA> <NA>
# only the last row is converted to <NA>s
data <- readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")
readr::format_csv(data)
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"
readr::format_csv(data, na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"
technically, a string should not be quoted unless it's necessary
string <- 'x,y\n"NA","NA"\nNA,NA\n,'
readr::format_csv(readr::read_csv(string))
#> [1] "x,y\nNA,NA\nNA,NA\nNA,NA\n"
readr::format_csv(readr::read_csv(string, na = ""))
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"
readr::format_csv(readr::read_csv(string, na = ""), na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"
@cjyetman this is exactly what I did everywhere in my new commit.
I think that's the best thing to do... but again, just be careful that if any CSVs are written that they don't write <NA>
as NA
(without the quotes), which some CSV writers will do by default (for instance readr
, e.g. readr::format_csv(data.frame(x = NA)) # [1] "x\nNA\n"
.
Yes, I added na=“”
to write calls too.
I think this is fixed. Feel free to reopen or comment if it still fails after reinstall from GH
better example of why you need to be careful of both ends of the round trip...
my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")))
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A" "NA"
# BAD
my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")), na = "")
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A" NA
# GOOD
if na = ""
is used everywhere for readr::read_csv
, the same has to be used everywhere for readr::write_csv
Again Namibia. I have realised that in four coding schemes (
eurostat, genc2c, wb_api2c, ecb
) is missing since in all of them the 2-letter code isNA
. See sources:get_eurostat
): https://github.com/vincentarelbundock/countrycode/blob/75e3263b8e53372a84e0a5d6bd3da2f408a44807/dictionary/get_eurostat.R#L4Reprex with the latest CRAN release
Reprex after PR
Now only
eu28
is missing, that it is ok (I leave out of the exercise thecldr*
fields for clarity).I have prepared a PR that hopefull fixes this issue,
Regards