rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
235 stars 46 forks source link

curl::has_internet() breaks for those who work behind a proxy #150

Closed cbizzo closed 4 years ago

cbizzo commented 4 years ago

I left a comment on the commit also: https://github.com/rOpenGov/eurostat/commit/19a22f3883baccdf78f4507be3ea5f5e8dbb2ec4#commitcomment-36824920

Perhaps first check if has_internet and if not, if http_proxy is defined?

muuankarski commented 4 years ago

Thanks, for bringing this up. Do you think it is still the case after this fix jeroen/curl@6fc0305 in curl::has_internet()

JensXII commented 4 years ago

If the fix is has_internet <- function(){ !is.null(nslookup("google.com", error = FALSE)) } then it works fine. When and where will this fix be avaiable?

cbizzo commented 4 years ago

Hi I was away from the corporate proxy so couldn't text these over the weekend. Unfortunately both the update to curl and the original leave the same outcome: FALSE. I have time to test some other approaches today. If I think of something I will inform you.

cbizzo commented 4 years ago

OK, I've done some digging. If you like I can submit a pull request with the changes. I describe the process below.

First, after looking deeper into the issue, curl_fetch_memory also does not work inside my work, nor do most curl functions. That means eurostat::get_eurostat_json() and similar functions which make such a call would not work for me also before this change.

My usage of eurostat is typically of the form eurostat::get_eurostat(id, filters). Looking under the hood, I've realised that this in turn will call get_eurostat_raw() which relies on download.file().

On Windows, unless otherwise configured download.file() will use wininet.dll to download a file over http and https. This is the similar to using Internet Explorer to download a file.

From what I can tell, that my workplace allows Internet Explorer to access the internet, but does not allow RStudio to, without recourse to proxy information.

My solution is to remove the curl::has_internet() call from get_eurostat() , but to leave it in any function that directly uses curl. This way when get_eurostat_raw() is called, which does not currently have a has_internet() call inside the function will proceed to download.

If a function is needed to check if the internet is available via wininet (or more specifically, download.file() you might consider pingr::is_online() or just to use download.file on something you always expect to be up, e.g. "http://captive.apple.com/hotspot-detect.html". Like so:

# heavily influenced by https://github.com/r-lib/pingr/blob/master/R/http.R

has_internet_download.file <- function(){ 
  temp <- tempfile()
     suppressWarnings(try(download.file("http://captive.apple.com/html", temp, quiet= TRUE), silent = TRUE))
     if (is.na(file.info(temp)$size)){
           FALSE
    }
     else{
         data<- readChar(temp, file.info(temp)$size)
         grepl("Success", readChar(temp, file.info(temp)$size), data)
    }
}
antagomir commented 4 years ago

Doing some testing now. If @muuankarski has any comments that might help.

muuankarski commented 4 years ago

Ok, had a look at this using curl version 4.3 in Windows behind a proxy. As pointed out by @cbizzo eurostat uses download.file() for data downloads that do work in Windows behind a proxy with no need for http_proxy/https_proxy- environment variables. Therefore eurostat has worked for people using it in such settings.

However, we need to comply with CRAN policy...

Packages which use Internet resources should fail gracefully with an informative message if the resource is not available (and not give a check warning nor error)

...and need a way to prevent error when resource is not available,

I was not able to produce FALSE with curl::has_internet() when proxy-variables were set empty. curl::curl_download() only works with proper proxy-settings. (In linux, behind a proxy, you need to define the proxy-settings even with download.file().) This is how it was checked.

# EU country level geojson
url <- "https://ec.europa.eu/eurostat/cache/GISCO/distribution/v2/nuts/geojson/NUTS_RG_60M_2006_4326_LEVL_0.geojson"
tmpfile <- tempfile()

# Empty proxy settings
Sys.setenv(http_proxy = "")
Sys.setenv(https_proxy = "")

curl::has_internet()
# [1] TRUE
download.file(url, tmpfile, mode = "wb")
# Content type 'text/plain' length 141040 bytes (137 KB)
# downloaded 137 KB

curl::curl_download(url, tmpfile)
# Error in curl::curl_download(url, tmpfile) : 
#   Failed to connect to ec.europa.eu port 443: Timed out

# Corporate proxy settings
Sys.setenv(http_proxy = "put your proxy settings here")
Sys.setenv(https_proxy = "put your proxy settings here")

curl::has_internet()
# [1] TRUE
download.file(url, tmpfile, mode = "wb")
# Content type 'text/plain' length 141040 bytes (137 KB)
# downloaded 137 KB

curl::curl_download(url, tmpfile)
# success

Although I can't reproduce the behaviour, I agree that we should find an replacement for has_internet(). However, there are users with offline environments that can only access to few url's in internet, http://captive.apple.com not being one of them. For them it would always produce an error although they do have access to ec.europa.eu. Therefore, would something like this do the trick?

has_internet_download.file <- function(){ 
  temp <- tempfile()
  suppressWarnings(try(download.file("http://ec.europa.eu/eurostat/cache/GISCO/distribution/v2/nuts/geojson/NUTS_RG_60M_2006_4326_LEVL_0.geojson", temp, quiet= TRUE), silent = TRUE))
  if (is.na(file.info(temp)$size)){
    FALSE
  }
  else{
    # data<- readChar(con = temp, nchars = file.info(temp)$size)
    # grepl("Success", readChar(temp, file.info(temp)$size), data)
    TRUE
  }
}

Any ideas?

antagomir commented 4 years ago

My impression is that this would do the trick, and would be ready to proceed with it. Eagerly looking fwd to comments from @cbizzo and @JensXII

JensXII commented 4 years ago

I am a ignorant on network setup. Hence, I cannot comment on, if I think it will work. I have two computers. One where curl::has_internet() TRUE and one with FALSE. I have looked at the system evironments Sys.getenv() on both of them. None of them have anything with http_proxy or anything with proxy. They are actually alike. On the other hand, having these two computers, I might be able to test. But you have to tell me exactly what to do.

muuankarski commented 4 years ago

download.file() hangs the system in unix if there is no access to resource. I implemented a slightly improved solution in branch https://github.com/rOpenGov/eurostat/tree/feature/check_internet

@JensXII perhaps you can run the lines below and report if it works for you?

remotes::install_github("ropengov/eurostat", ref = "feature/check_internet")
library(eurostat)
packageVersion("eurostat") # Should be ‘3.4.20006’

# Set proxies empty
Sys.setenv(http_proxy = "")
Sys.setenv(https_proxy = "")
# check they really are empty
Sys.getenv("http_proxy")
Sys.getenv("https_proxy")

eurostat::clean_eurostat_cache()
eurostat::search_eurostat(pattern = "Fertility")
eurostat::get_eurostat(id = "demo_r_frate2")

In unix, behind a proxy with no proxy defined, this should gracefully fail and produce a message

You have no access to ec.europe.eu. 
             Please check your connection and/or review your proxy settings
cbizzo commented 4 years ago

Hi I finally got around to looking. This fix fixes behind the firewall for our uses. Thanks a lot for taking care.

cbizzo commented 4 years ago

If it's pulled into master we can close the issue. Thanks.

JensXII commented 4 years ago

I have done the testing suggested by muuankarski, but changed to get_eurostat to what caused me problem in the first place. As you can see below and cbizzo already has concluded - it work.

Hence, I agree, get the cran-package updated, and close the issue.

Thank you, for the quick solving of the problem :-)

  1. on a computer with curl::has_internet() FALSE:

curl::has_internet() [1] FALSE

remotes::install_github("ropengov/eurostat", ref = "feature/check_internet") Skipping install of 'eurostat' from a github remote, the SHA1 (06c1fc09) has not changed since last install. Use force = TRUE to force installation library(eurostat) packageVersion("eurostat") # Should be ‘3.4.20006’ [1] ‘3.4.20006’

Set proxies empty

Sys.setenv(http_proxy = "") Sys.setenv(https_proxy = "")

check they really are empty

Sys.getenv("http_proxy") [1] "" Sys.getenv("https_proxy") [1] ""

eurostat::clean_eurostat_cache() Deleted .rds files from C:\Users\NLS\AppData\Local\Temp\4\RtmpA3HIko/eurostat eurostat::search_eurostat(pattern = "Fertility")

A tibble: 10 x 8

title code type last update of data last table structure change data start data end values

1 Fertility rates by age and NUTS 2 region demo_r_frate2 dataset 13.03.2019 10.05.2019 1990 2017 NA 2 Fertility indicators by NUTS 2 region demo_r_find2 dataset 15.03.2019 10.05.2019 1990 2017 NA 3 Fertility indicators by NUTS 3 region demo_r_find3 dataset 17.10.2019 28.02.2019 2013 2017 NA 4 Fertility and mortality - cities and greater cities urb_cfermor dataset 21.01.2020 21.01.2020 1990 2019 NA 5 Fertility and mortality - functional urban areas urb_lfermor dataset 21.01.2020 21.01.2020 1990 2019 NA 6 Fertility indicators demo_find dataset 06.11.2019 15.08.2019 1960 2017 NA 7 Fertility rates by age demo_frate dataset 06.11.2019 06.11.2019 1960 2017 NA 8 Fertility rates by age and NUTS 2 region demo_r_frate2 dataset 13.03.2019 10.05.2019 1990 2017 NA 9 Fertility indicators by NUTS 2 region demo_r_find2 dataset 15.03.2019 10.05.2019 1990 2017 NA 10 Fertility indicators by NUTS 3 region demo_r_find3 dataset 17.10.2019 28.02.2019 2013 2017 NA # eurostat::get_eurostat(id = "demo_r_frate2") pop_data <- subset(eurostat::get_eurostat("demo_r_pjangrp3", time_format = "num"), + (age == "TOTAL") & (sex == "T") & (nchar(trimws(geo)) == 5))[, c("time","geo","values")] trying URL 'https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fdemo_r_pjangrp3.tsv.gz' Content type 'application/octet-stream;charset=UTF-8' length 2294016 bytes (2.2 MB) downloaded 2.2 MB

Table demo_r_pjangrp3 cached at C:\Users\NLS\AppData\Local\Temp\4\RtmpA3HIko/eurostat/demo_r_pjangrp3_num_code_TF.rds

  1. on a computer with curl::has_internet() TRUE:

curl::has_internet() [1] TRUE remotes::install_github("ropengov/eurostat", ref = "feature/check_internet") Downloading GitHub repo ropengov/eurostat@feature/check_internet These packages have more recent versions available. Which would you like to update?

1: All
2: CRAN packages only
3: None
4: tidyr (0.8.3 -> 1.0.0) [CRAN] 5: hms (0.5.0 -> 0.5.3) [CRAN]

Enter one or more numbers, or an empty line to skip updates: library(eurostat) Enter one or more numbers, or an empty line to skip updates: packageVersion("eurostat") # Should be ‘3.4.20006’ Enter one or more numbers, or an empty line to skip updates: 2 tidyr (0.8.3 -> 1.0.0) [CRAN] hms (0.5.0 -> 0.5.3) [CRAN] Installing 2 packages: tidyr, hms trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/tidyr_1.0.0.zip' Content type 'application/zip' length 1293390 bytes (1.2 MB) downloaded 1.2 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/hms_0.5.3.zip' Content type 'application/zip' length 107433 bytes (104 KB) downloaded 104 KB

package ‘tidyr’ successfully unpacked and MD5 sums checked package ‘hms’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:\Users\nls\AppData\Local\Temp\RtmpcbV1oF\downloaded_packages √ checking for file 'C:\Users\nls\AppData\Local\Temp\RtmpcbV1oF\remotes1bd47e63168d\rOpenGov-eurostat-06c1fc0/DESCRIPTION' (1.9s)

Table demo_r_frate2 cached at C:\Users\nls\AppData\Local\Temp\RtmpcbV1oF/eurostat/demo_r_frate2_date_code_TF.rds

A tibble: 431,470 x 5

unit age geo time values

1 NR TOTAL AL0 2017-01-01 1.47 2 NR TOTAL AT 2017-01-01 1.52 3 NR TOTAL AT1 2017-01-01 1.46 4 NR TOTAL AT11 2017-01-01 1.36 5 NR TOTAL AT12 2017-01-01 1.57 6 NR TOTAL AT13 2017-01-01 1.41 7 NR TOTAL AT2 2017-01-01 1.48 8 NR TOTAL AT21 2017-01-01 1.51 9 NR TOTAL AT22 2017-01-01 1.46 10 NR TOTAL AT3 2017-01-01 1.62 # ... with 431,460 more rows > pop_data <- subset(eurostat::get_eurostat("demo_r_pjangrp3", time_format = "num"), + (age == "TOTAL") & (sex == "T") & (nchar(trimws(geo)) == 5))[, c("time","geo","values")] trying URL 'https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fdemo_r_pjangrp3.tsv.gz' Content type 'application/octet-stream;charset=UTF-8' length 2294016 bytes (2.2 MB) downloaded 2.2 MB Table demo_r_pjangrp3 cached at C:\Users\nls\AppData\Local\Temp\RtmpcbV1oF/eurostat/demo_r_pjangrp3_num_code_TF.rds >
antagomir commented 4 years ago

@muuankarski when you are ready, can you merge to devel branch; I can take care of the rest of testing, CRAN, and master then.

muuankarski commented 4 years ago

merged in to devel

antagomir commented 4 years ago

Fixed in master now & on the way to CRAN.

antagomir commented 4 years ago

In CRAN now: https://cran.r-project.org/web/packages/eurostat/index.html