jeroen / curl

A Modern and Flexible Web Client for R
https://jeroen.r-universe.dev/curl
Other
216 stars 71 forks source link

curl seems to ignore http_proxy environment variable on R 4.2.0 #268

Open ysaidani opened 2 years ago

ysaidani commented 2 years ago

On R 4.1.3, setting the proxy via the environment variables http_proxy and https_proxy works flawlessly:

# RGui 4.1.3, 64-bit
library(curl)
# Using libcurl 7.64.1 with Schannel
packageVersion("curl")
# [1] ‘4.3.2’
Sys.setenv("http_proxy" = curl::ie_get_proxy_for_url())
Sys.setenv("https_proxy" = curl::ie_get_proxy_for_url())
my_handle = new_handle()
handle_setopt(my_handle, .list = list(proxyuserpwd = ":")) # just ignore, required behind corporate Kerberos proxy
invisible(file.remove("test.html"))
curl_download("http://www.google.de", "test.html", handle = my_handle)
file.exists("test.html") # works
# [1] TRUE

But on R 4.2.0, curl does not seem to pick up the environment variables correctly:

# RGui 4.2.0, 64-bit - does not work
library(curl)
# Using libcurl 7.64.1 with Schannel
packageVersion("curl")
# [1] ‘4.3.2’
Sys.setenv("http_proxy" = curl::ie_get_proxy_for_url())
Sys.setenv("https_proxy" = curl::ie_get_proxy_for_url())
my_handle = new_handle()
handle_setopt(my_handle, .list = list(proxyuserpwd = ":")) # just ignore, required behind corporate Kerberos proxy
invisible(file.remove("test.html"))
curl_download("http://www.google.de", "test.html", handle = my_handle)
# Fehler in curl_download("http://www.google.de", "test.html", handle = my_handle) : 
#  Timeout was reached: [] Connection timed out after 10001 milliseconds

However, setting the proxy using handle_setopt works without problem:

# RGui 4.2.0, 64-bit - same session as above - works
handle_setopt(my_handle, .list = list(proxy = curl::ie_get_proxy_for_url()))
invisible(file.remove("test.html"))
curl_download("http://www.google.de", "test.html", handle = my_handle)
file.exists("test.html")
# [1] TRUE

# Confirm that it is indeed handle_setopt that makes it work, not the environment variables:
Sys.unsetenv(c("http_proxy", "https_proxy"))
handle_setopt(my_handle, .list = list(proxy = curl::ie_get_proxy_for_url()))
invisible(file.remove("test.html"))
curl_download("http://www.google.de", "test.html", handle = my_handle)
file.exists("test.html")
# [1] TRUE

I am unsure if this is an issue with R 4.2.0 or with curl, and whether it is reproducible on other systems with proxies.

jeroen commented 2 years ago

Hmm I cannot reproduce this, at least not with a simple http proxy. I don't have a kerebos proxy, but what I did is install fiddler and then run:

library(curl)
Sys.setenv("http_proxy" = 'http://localhost:8888')
Sys.setenv("https_proxy" = 'http://localhost:8888')
my_handle <- new_handle(proxyuserpwd = ":")
curl_download("http://www.google.de", "test.html", handle = my_handle)

And I can see the request being routed throufh fiddler in both R-4.1 and R-4.2. Are you sure your proxy is the same in both cases? What do you get for curl::ie_get_proxy_for_url() ?

eitsupi commented 2 years ago

I believe this is the effect of download.file(method = "wininet") being changed to download.file(method = "libcurl") in R 4.2.0 on Windows.

In other words, until now functions such as download.file authenticated the Proxy via wininet, allowing other tools without the ability to authenticate the Proxy to pass through the Proxy server, but now that download.file no longer authenticates the Proxy, other tools will not be able to pass through the Proxy server. The reason it has been possible to pass through the Proxy without setting a user name and password is because other tools were authenticating the Proxy.

I use R, etc. in an NTLM-authenticated proxy environment, and the most reliable way to get through the proxy is to have a proxy server such as Cntlm or px do the NTLM (and the other) authentication for you. Alternatively, since the curl cli can pass NTLM authentication by setting the --proxy-ntlm option, we can loop curl endlessly on the terminal to continue passing Proxy authentication, while other tools pass the Proxy server in the meantime.

My recommendation on Windows is to use px. px uses the full Windows configuration, so there is no need to store authentication information anywhere, and it does not require administrative privileges to install.

Sorry for my Japanese, but I wrote the following article before. https://qiita.com/eitsupi/items/226b65d54e207a7c5fe7

jeroen commented 2 years ago

I believe this is the effect of download.file(method = "wininet") being changed to download.file(method = "libcurl") in R 4.2.0 on Windows.

I thought we were talking about the implementation in the curl package, e.g. curl_download? That should not have changed.

Indeed the base-R download.file has changed defaults in R-4.2 but this is unrelated to the curl R package. To complain about base-R download.file changes you need to post in: https://bugs.r-project.org/

eitsupi commented 2 years ago

I thought we were talking about the implementation in the curl package, e.g. curl_download? That should not have changed.

Sorry for my lack of explanation, my point is that the curl package may not have been able to authenticate this proxy before. I suspect that the reason for the Internet connection despite the lack of authentication is that the other tool (in this case, R's download.file function) had performed proxy authentication immediately before using curl_download function, thus eliminating the need for authentication.

atocharnaud commented 2 years ago

I confirm we have a similar issue with R 4.2.0 on Windows , libcurl does not respect the CA_CURL_BUNDLE to use our self signed certificates.