Open cbizzo opened 4 years ago
Thank you for reporting and proposing a solution.
These connection issues are really tricky. This could also create problems, as in some systems file()
could be using a curl-library, in which redirections seems to be a problem, as in #181. I resolved that (i think) by moving away from download.file
to directly using readr::read_tsv, but could it then have these proxy problems?
Do someone have idea, what could be most coherent solution?
Also, have you tried to set proxy options in get_eurostat
? like config = httr::use_proxy(url, port, username, password)
.
I have a lot of sympathy for anyone who has to deal with connection issues! Thanks for taking the time to respond to me :)
I had a look through #181 and also through the read_tsv calls. I think in the two cases that count wrapping the variable tname
and url
inside a url()
call, is enough to invoke the system preferred library:
price_url<-eurostat:::eurostat_json_url("prc_hicp_midx",
filters = list(geo="EA",
unit="I96",
coicop= c('CP00', 'CP011', 'CP02', 'CP03')),
lang ='en')
options('url.method'= 'default')
url(price_url)
# A connection with
# description "http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&unit=I96&coicop=CP00&coicop=CP011&coicop=CP02&coicop=CP03"
# class "url-wininet"
# mode "r"
# text "text"
# opened "closed"
# can read "yes"
# can write "no"
options('url.method'= 'libcurl')
url(price_url)
# A connection with
# description "http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&unit=I96&coicop=CP00&coicop=CP011&coicop=CP02&coicop=CP03"
# class "url-libcurl"
# mode "r"
# text "text"
# opened "closed"
# can read "yes"
# can write "no"
options('url.method'= 'internal')
url(price_url)
# A connection with
# description "http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&unit=I96&coicop=CP00&coicop=CP011&coicop=CP02&coicop=CP03"
# class "url"
# mode "r"
# text "text"
# opened "closed"
# can read "yes"
# can write "no"
I think without wrapping a the URL value with the url()
function, you would create problems for users who are working behind a proxy.
Nevertheless, I looked into the internals of readr
, specifically read_tsv
and have copied out the part which handles the file download and pasted it into the version of the function I shared above.
get_eurostat_json_2<-function(id, filters = NULL, type = c("code", "label",
"both"), lang = c("en", "fr", "de"),
stringsAsFactors = default.stringsAsFactors(), ...){
json_url <- eurostat_json_url(id = id, filters = filters,
lang = lang)
file<-url(json_url)
name<- readr:::source_name(file)
file<- readr:::standardise_path(file)
if (readr:::is.connection(file)) {
data <- readr:::datasource_connection(file, skip=F, skip_empty_rows=F)
} else{
stop("file is not a valid connection")
}
content <- file(data[[1]])
json <- readLines(content, warn = F)
close(content)
jdat<-jsonlite::fromJSON(json)
type <- 'code'
dims <- jdat$dimension
ids <- jdat$id
dims_list <- lapply(dims[rev(ids)], function(x) {
y <- x$category$label
if (type[1] == "label") {
y <- unlist(y, use.names = FALSE)
}
else if (type[1] == "code") {
y <- names(unlist(y))
}
else if (type[1] == "both") {
y <- unlist(y)
}
else {
stop("Invalid type ", type)
}
})
variables <- expand.grid(dims_list, KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = stringsAsFactors)
dat <- as.data.frame(variables[rev(names(variables))])
vals <- unlist(jdat$value, use.names = FALSE)
dat$values <- rep(NA, nrow(dat))
inds <- 1 + as.numeric(names(jdat$value))
if (!length(vals) == length(inds)) {
stop("Complex indexing not implemented.")
}
dat$values[inds] <- vals
tibble::as_tibble(dat)
}
One last thought though, on how to merge this version with the existing version, if you wrap the URL generated in a url()
function, you can conditionally use WinINet if it is the default option set, otherwise, it defaults to whatever httr
does. :
get_eurostat_json<-function (id, filters = NULL, type = c("code", "label",
"both"), lang = c("en", "fr", "de"),
stringsAsFactors = default.stringsAsFactors(), ...)
{
if (!check_access_to_data()) {
message("You have no access to ec.europe.eu. \n Please check your connection and/or review your proxy settings")
}
else {
url <- eurostat_json_url(id = id, filters = filters,
lang = lang)
file<- url(url)
if(summary(file)$class == 'url-wininet'){
name<- readr:::source_name(file)
file<- readr:::standardise_path(file)
if (readr:::is.connection(file)) {
data <- readr:::datasource_connection(file, skip=F, skip_empty_rows=F)
} else{
stop("file is not a valid connection")
}
content <- file(data[[1]])
json <- readLines(content, warn = F)
close(content)
jdat<-jsonlite::fromJSON(json)
} else{
resp <- httr::GET(url)
if (httr::http_error(resp)) {
stop(paste("The requested url cannot be found within the get_eurostat_json function:",
url))
}
status <- httr::status_code(resp)
msg <- ". Some datasets are not accessible via the eurostat\n interface. You can try to search the data manually from the comext\n \t database at http://epp.eurostat.ec.europa.eu/newxtweb/ or bulk\n \t download facility at\n \t http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing\n \t or annual Excel files\n \t http://ec.europa.eu/eurostat/web/prodcom/data/excel-files-nace-rev.2"
if (status == 200) {
jdat <- jsonlite::fromJSON(url)
}
else if (status == 400) {
stop("Failure to get data. Probably invalid dataset id. Status code: ",
status, msg)
}
else if (status == 500) {
stop("Failure to get data. Probably filters did not return any data \n or data exceeded query size limitation. Status code: ",
status, msg)
}
else {
stop("Failure to get data. Status code: ",
status, msg)
}
}
dims <- jdat$dimension
ids <- jdat$id
dims_list <- lapply(dims[rev(ids)], function(x) {
y <- x$category$label
if (type[1] == "label") {
y <- unlist(y, use.names = FALSE)
}
else if (type[1] == "code") {
y <- names(unlist(y))
}
else if (type[1] == "both") {
y <- unlist(y)
}
else {
stop("Invalid type ", type)
}
})
variables <- expand.grid(dims_list, KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = stringsAsFactors)
dat <- as.data.frame(variables[rev(names(variables))])
vals <- unlist(jdat$value, use.names = FALSE)
dat$values <- rep(NA, nrow(dat))
inds <- 1 + as.numeric(names(jdat$value))
if (!length(vals) == length(inds)) {
stop("Complex indexing not implemented.")
}
dat$values[inds] <- vals
tibble::as_tibble(dat)
}
}
The problem for setting the proxy on my side is not one for me, but for new users internally, for whom any speak of networking is already 'very deep' technical knowledge :) .
If this can be solved, it would be great and highly relevant. The only thing I am concerned about is that we might like to be careful with how much extra overheads and instability the possible solution would introduce in the code. This is not the only package with this problem, and in general institutions may need to develop more generic solutions for these anyway if there is a need to use these workflows. Although, making it easier for some packages might help to advance those efforts..
Certainly. If this can't be incorporated because of complexity/stability risks, perhaps it is sufficient to leave this issue here. Others may copy the code snippet above if they face similar problems.
Thanks @cbizzo for working on this issue. I am also bit hesitant to use non-exported functions as solution. However, your first solution using file()
could be fine, but we should be wary of the redirection issue as well.
My understandin is that majorty of users do not have these issues, but
httr::GET()
might be blocked by proxy.
In some systems url
(and file
I guess) uses curl-library in which case redirections are not working without -L
.
Question is how to solve them both with a robust solution.
Hi, I noticed that the
filter
option ofget_eurostat
breaks when working on Windows and behind a corporate firewall.However if I don't specify the filters this works fine:
Digging deeper, I can see it is caused by the
httr::GET()
command inget_eurostat_json
, which is invoked when the user specifies a filter value. The difference is, when no filter option is calledfile.download
is used to get the data, which may use WinINet, or libcurl depending on the users installation, whereas if the filter option is used the user is forced to use libcurl (via httr) which may require proxy settings.Could you add an option or a separate function which allows the user to download using
file()
instead of httr?file()
will use the system default (libcurl on UNIX/Linux, WinINet on Windows).The below code replicates your excellent work but has some changes that allow the download to work behind a corporate firewall, without need to revert to the proxy:
I think this issue is related (at least thematically) to an issue I raised before https://github.com/rOpenGov/eurostat/issues/150