rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
235 stars 46 forks source link

filter option in get_eurostat() doesn't work behind firewall #185

Open cbizzo opened 4 years ago

cbizzo commented 4 years ago

Hi, I noticed that the filter option of get_eurostat breaks when working on Windows and behind a corporate firewall.

price<-eurostat::get_eurostat("prc_hicp_midx", filters = list(geo="EA", unit="I96", coicop= c('CP00', 'CP011', 'CP02', 'CP03')), type = 'label')
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Could not resolve host: ec.europa.eu

However if I don't specify the filters this works fine:

price<-eurostat::get_eurostat("prc_hicp_midx", type = 'label')
trying URL 'https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fprc_hicp_midx.tsv.gz'
Content type 'application/octet-stream;charset=UTF-8' length 11874905 bytes (11.3 MB)
downloaded 11.3 MB

Digging deeper, I can see it is caused by the httr::GET() command in get_eurostat_json, which is invoked when the user specifies a filter value. The difference is, when no filter option is called file.download is used to get the data, which may use WinINet, or libcurl depending on the users installation, whereas if the filter option is used the user is forced to use libcurl (via httr) which may require proxy settings.

Could you add an option or a separate function which allows the user to download using file() instead of httr? file() will use the system default (libcurl on UNIX/Linux, WinINet on Windows).

The below code replicates your excellent work but has some changes that allow the download to work behind a corporate firewall, without need to revert to the proxy:

get_eurostat_json_2<-function(id, filters = NULL, type = c("code", "label", 
                                                           "both"), lang = c("en", "fr", "de"), 
                              stringsAsFactors = default.stringsAsFactors(), ...){

  url <- eurostat_json_url(id = id, filters = filters, 
                           lang = lang)
  content <- file(url)
  json <- readLines(content, warn = F)
  close(content)
  jdat<-jsonlite::fromJSON(json)  
  type <- 'code' 
  dims <- jdat$dimension
  ids <- jdat$id
  dims_list <- lapply(dims[rev(ids)], function(x) {
    y <- x$category$label
   if (type[1] == "label") {
     y <- unlist(y, use.names = FALSE)
   }
   else if (type[1] == "code") {
     y <- names(unlist(y))
    }
  else if (type[1] == "both") {
    y <- unlist(y)
    }
    else {
     stop("Invalid type ", type)
   }
  })
  variables <- expand.grid(dims_list, KEEP.OUT.ATTRS = FALSE, 
                          stringsAsFactors = stringsAsFactors)
  dat <- as.data.frame(variables[rev(names(variables))])
  vals <- unlist(jdat$value, use.names = FALSE)
  dat$values <- rep(NA, nrow(dat))
  inds <- 1 + as.numeric(names(jdat$value))
  if (!length(vals) == length(inds)) {
   stop("Complex indexing not implemented.")
  }
  dat$values[inds] <- vals
  tibble::as_tibble(dat)
}

I think this issue is related (at least thematically) to an issue I raised before https://github.com/rOpenGov/eurostat/issues/150

jhuovari commented 4 years ago

Thank you for reporting and proposing a solution.

These connection issues are really tricky. This could also create problems, as in some systems file()could be using a curl-library, in which redirections seems to be a problem, as in #181. I resolved that (i think) by moving away from download.file to directly using readr::read_tsv, but could it then have these proxy problems?

Do someone have idea, what could be most coherent solution?

Also, have you tried to set proxy options in get_eurostat? like config = httr::use_proxy(url, port, username, password).

cbizzo commented 4 years ago

I have a lot of sympathy for anyone who has to deal with connection issues! Thanks for taking the time to respond to me :)

I had a look through #181 and also through the read_tsv calls. I think in the two cases that count wrapping the variable tname and url inside a url() call, is enough to invoke the system preferred library:

 price_url<-eurostat:::eurostat_json_url("prc_hicp_midx", 
                                         filters = list(geo="EA", 
                                                        unit="I96",
                                                        coicop= c('CP00', 'CP011', 'CP02', 'CP03')),
                                         lang  ='en')

options('url.method'= 'default')
url(price_url)
# A connection with                                                                                                                                                     
# description "http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&unit=I96&coicop=CP00&coicop=CP011&coicop=CP02&coicop=CP03"
# class       "url-wininet"                                                                                                                            
# mode        "r"                                                                                                                                      
# text        "text"                                                                                                                                   
# opened      "closed"                                                                                                                                 
# can read    "yes"                                                                                                                                    
# can write   "no"                                                                                                                                     

options('url.method'= 'libcurl')
url(price_url)
# A connection with                                                                                                                                                     
# description "http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&unit=I96&coicop=CP00&coicop=CP011&coicop=CP02&coicop=CP03"
# class       "url-libcurl"                                                                                                                            
# mode        "r"                                                                                                                                      
# text        "text"                                                                                                                                   
# opened      "closed"                                                                                                                                 
# can read    "yes"                                                                                                                                    
# can write   "no"                                                                                                                                     

options('url.method'= 'internal')
url(price_url)
# A connection with                                                                                                                                                     
# description "http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_midx?geo=EA&unit=I96&coicop=CP00&coicop=CP011&coicop=CP02&coicop=CP03"
# class       "url"                                                                                                                                    
# mode        "r"                                                                                                                                      
# text        "text"                                                                                                                                   
# opened      "closed"                                                                                                                                 
# can read    "yes"                                                                                                                                    
# can write   "no"                                                                      

I think without wrapping a the URL value with the url() function, you would create problems for users who are working behind a proxy.

Nevertheless, I looked into the internals of readr, specifically read_tsv and have copied out the part which handles the file download and pasted it into the version of the function I shared above.

get_eurostat_json_2<-function(id, filters = NULL, type = c("code", "label", 
                                                           "both"), lang = c("en", "fr", "de"), 
                              stringsAsFactors = default.stringsAsFactors(), ...){

  json_url <- eurostat_json_url(id = id, filters = filters, 
                           lang = lang)

  file<-url(json_url)
  name<- readr:::source_name(file)
  file<- readr:::standardise_path(file)
  if (readr:::is.connection(file)) {
    data <- readr:::datasource_connection(file, skip=F, skip_empty_rows=F)
  } else{
    stop("file is not a valid connection")
  }
  content <- file(data[[1]])
  json <- readLines(content, warn = F)
  close(content)
  jdat<-jsonlite::fromJSON(json)  
  type <- 'code' 
  dims <- jdat$dimension
  ids <- jdat$id
  dims_list <- lapply(dims[rev(ids)], function(x) {
    y <- x$category$label
    if (type[1] == "label") {
      y <- unlist(y, use.names = FALSE)
    }
    else if (type[1] == "code") {
      y <- names(unlist(y))
    }
    else if (type[1] == "both") {
      y <- unlist(y)
    }
    else {
      stop("Invalid type ", type)
    }
  })
  variables <- expand.grid(dims_list, KEEP.OUT.ATTRS = FALSE, 
                           stringsAsFactors = stringsAsFactors)
  dat <- as.data.frame(variables[rev(names(variables))])
  vals <- unlist(jdat$value, use.names = FALSE)
  dat$values <- rep(NA, nrow(dat))
  inds <- 1 + as.numeric(names(jdat$value))
  if (!length(vals) == length(inds)) {
    stop("Complex indexing not implemented.")
  }
  dat$values[inds] <- vals
  tibble::as_tibble(dat)
}

One last thought though, on how to merge this version with the existing version, if you wrap the URL generated in a url() function, you can conditionally use WinINet if it is the default option set, otherwise, it defaults to whatever httr does. :

get_eurostat_json<-function (id, filters = NULL, type = c("code", "label", 
                                       "both"), lang = c("en", "fr", "de"), 
          stringsAsFactors = default.stringsAsFactors(), ...) 
{
  if (!check_access_to_data()) {
    message("You have no access to ec.europe.eu. \n      Please check your connection and/or review your proxy settings")
  }
  else {
    url <- eurostat_json_url(id = id, filters = filters, 
                             lang = lang)
    file<- url(url)
    if(summary(file)$class == 'url-wininet'){
      name<- readr:::source_name(file)
      file<- readr:::standardise_path(file)
      if (readr:::is.connection(file)) {
        data <- readr:::datasource_connection(file, skip=F, skip_empty_rows=F)
      } else{
        stop("file is not a valid connection")
      }
      content <- file(data[[1]])
      json <- readLines(content, warn = F)
      close(content)
      jdat<-jsonlite::fromJSON(json)  
    } else{

      resp <- httr::GET(url)
      if (httr::http_error(resp)) {
        stop(paste("The requested url cannot be found within the get_eurostat_json function:", 
                   url))
      }
      status <- httr::status_code(resp)
      msg <- ". Some datasets are not accessible via the eurostat\n          interface. You can try to search the data manually from the comext\n  \t  database at http://epp.eurostat.ec.europa.eu/newxtweb/ or bulk\n  \t  download facility at\n  \t  http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing\n  \t  or annual Excel files\n  \t  http://ec.europa.eu/eurostat/web/prodcom/data/excel-files-nace-rev.2"
      if (status == 200) {
        jdat <- jsonlite::fromJSON(url)
      }
      else if (status == 400) {
        stop("Failure to get data. Probably invalid dataset id. Status code: ", 
             status, msg)
      }
      else if (status == 500) {
        stop("Failure to get data. Probably filters did not return any data \n         or data exceeded query size limitation. Status code: ", 
             status, msg)
      }
      else {
        stop("Failure to get data. Status code: ", 
             status, msg)
      }
    }
    dims <- jdat$dimension
    ids <- jdat$id
    dims_list <- lapply(dims[rev(ids)], function(x) {
      y <- x$category$label
      if (type[1] == "label") {
        y <- unlist(y, use.names = FALSE)
      }
      else if (type[1] == "code") {
        y <- names(unlist(y))
      }
      else if (type[1] == "both") {
        y <- unlist(y)
      }
      else {
        stop("Invalid type ", type)
      }
    })
    variables <- expand.grid(dims_list, KEEP.OUT.ATTRS = FALSE, 
                             stringsAsFactors = stringsAsFactors)
    dat <- as.data.frame(variables[rev(names(variables))])
    vals <- unlist(jdat$value, use.names = FALSE)
    dat$values <- rep(NA, nrow(dat))
    inds <- 1 + as.numeric(names(jdat$value))
    if (!length(vals) == length(inds)) {
      stop("Complex indexing not implemented.")
    }
    dat$values[inds] <- vals
    tibble::as_tibble(dat)
  }
}

The problem for setting the proxy on my side is not one for me, but for new users internally, for whom any speak of networking is already 'very deep' technical knowledge :) .

antagomir commented 4 years ago

If this can be solved, it would be great and highly relevant. The only thing I am concerned about is that we might like to be careful with how much extra overheads and instability the possible solution would introduce in the code. This is not the only package with this problem, and in general institutions may need to develop more generic solutions for these anyway if there is a need to use these workflows. Although, making it easier for some packages might help to advance those efforts..

cbizzo commented 4 years ago

Certainly. If this can't be incorporated because of complexity/stability risks, perhaps it is sufficient to leave this issue here. Others may copy the code snippet above if they face similar problems.

jhuovari commented 4 years ago

Thanks @cbizzo for working on this issue. I am also bit hesitant to use non-exported functions as solution. However, your first solution using file() could be fine, but we should be wary of the redirection issue as well.

My understandin is that majorty of users do not have these issues, but

  1. httr::GET() might be blocked by proxy.

  2. In some systems url (and file I guess) uses curl-library in which case redirections are not working without -L.

Question is how to solve them both with a robust solution.