Open OscarLane opened 3 years ago
Have you read this vignette? https://cran.r-project.org/web/packages/curl/vignettes/windows.html
Also I think your proxyuserpwd value of :
is likely an error. Maybe just not set any proxyuserpwd if you don't want to auth.
Hi @jeroen, thanks for following up.
On :
, what you say is what I would have expected too. However, I get different results when using different handles (this is all assuming I have not set ALL_PROXY
to be anything).
For example, when I run
> h <- curl::new_handle(proxy = "http://corp_webproxy:8080")
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", handle = h)
Error in curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", :
Received HTTP code 407 from proxy after CONNECT
>
> h <- curl::new_handle(proxy = "http://corp_webproxy:8080", proxyuserpwd = ":")
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", handle = h)
the second request downloads successfully.
Though interestingly, after I have run these two curl requests, if I again run
> h <- curl::new_handle(proxy = "http://corp_webproxy:8080")
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", handle = h)
it completes successfully. It appears curl is remembering something from the request/handle with :
specified for subsequent requests.
The main aim for me, though, is to have these proxy settings automatically recognised when I start a new session. I have read the vignette you suggest a few times, and it appears to suggest the solution should be to set ALL_PROXY
to my proxy as I mentioned in my initial post. However, when I do this (in a new session) I get the following:
> Sys.setenv(
+ ALL_PROXY = "http://corp_webproxy:8080"
+ )
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt")
Error in curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt") :
Received HTTP code 407 from proxy after CONNECT
Even when I try to include :
to assist in authentication, I get the following result:
> Sys.setenv(
+ ALL_PROXY = "http://corp_webproxy:8080",
+ proxyuserpwd = ":"
+ )
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt")
Error in curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt") :
Received HTTP code 407 from proxy after CONNECT
This suggests to me that ALL_PROXY
and proxyuserpwd
are not being passed successfully to subsequent curl requests. Or at least not in the same way as when I include an explicit handle.
I don't know if this helps at all, but in httr I can run
library(httr)
set_config(
use_proxy(url = "http://corp_webproxy", port = 8080, auth = "ntlm", username = "", password = "")
)
and run GET requests just fine. I wonder if there is a way to pass the auth = "ntlm"
argument to curl through the environment settings? Have tried using proxyauth = 8
in Sys.setenv
but doesn't appear to help.
You can see the full list of supported environment variables here: https://curl.se/libcurl/c/libcurl-env.html
To see the full list of supported curl options (which are the same as in httr) use: curl::curl_options('proxy')
Sometimes you can add the user/pass in the URL like https://username:passwd@corp_webproxy
but I don't think there is a way to set curlopt_proxyauth in the environment variables. However note that by default we set it to CURLAUTH_ANY:
https://github.com/jeroen/curl/blob/4634c1c004f15897b93faa428e136a350c9422cf/src/handle.c#L180
So I am a bit surprised it doesn't work by default.
I can confirm that I have the same problem - which is becoming much more acute now that R 4.2.0 changed the default download method from wininet
(which picks up proxy settings on Windows) to libcurl
(which does not), meaning that R users on Windows behind corporate firewalls will not be able to perform actions as basic as downloading a package...
The problem is that the set of supported environment variables is significantly smaller than the set of supported curl options. Thus, there is a large number of curl options that cannot be set globally, and must be set for each handle individually. In many use cases, that is not possible or feasible (e.g. if calls to curl are buried deep in other packages). Since PROXYUSERPWD = ":"
cannot be set globally (for all new handles), the only workaround seems to be to overwrite curl::new_handle()
, as per this suggestion: https://stackoverflow.com/a/67838519. @OscarLane. This works for me, but ideally there ought to be a solution in curl
for this.
@jeroen would it be possible at all to allow users to set global options (most urgently, for proxyuserpwd
) that are picked up by all new handles?
I can confirm that I have the same problem - which is becoming much more acute now that R 4.2.0 changed the default download method from wininet (which picks up proxy settings on Windows) to libcurl (which does not), meaning that R users on Windows behind corporate firewalls will not be able to perform actions as basic as downloading a package...
The base-R download.file()
method is unrelated to the curl
R package, and out of my control. To complain about base-R behavior of download.file
or install.packages
you need to post to https://bugs.r-project.org/. (make sure you closely read the manual page for ?download.file
on proxies before you post).
would it be possible at all to allow users to set global options (most urgently, for proxyuserpwd) that are picked up by all new handles?
Not generally. The problem with global options is that different packages that build on curl start conflicting with each other, because they override each others preferences with "better defaults". We see this in other places where R uses global options, and it leads to very hard-to-debug bugs, where one package introduces side effects by changing global behavior that affects other packages.
In the case of curl, the only options that are appropriate to set globally are options that are things that the user may want to override. Proxy options is a good example, because it is not something that a package should be setting, but something specific to a given user installation. However for most of these global options, libcurl exposes environment variables already....
Not generally. The problem with global options is that different packages that build on curl start conflicting with each other, because they override each others preferences with "better defaults". We see this in other places where R uses global options, and it leads to very hard-to-debug bugs, where one package introduces side effects by changing global behavior that affects other packages.
Thank you for the explanation, much appreciated!
In the case of curl, the only options that are appropriate to set globally are options that are things that the user may want to override. Proxy options is a good example, because it is not something that a package should be setting, but something specific to a given user installation. However for most of these global options, libcurl exposes environment variables already....
Would proxyuserpwd
be such an option potentially? Being able to set this particular option globally would solve this issue. Like a proxy, it seems like it is:
... not something that a package should be setting, but something specific to a given user installation
Would proxyuserpwd be such an option potentially? Being able to set this particular option globally would solve this issue. Like a proxy, it seems like it is:
Yes I think that may make sense. Can you try to explain in more detail in which ways setting the http_proxy
and https_proxy
environment variables as described in https://curl.se/libcurl/c/libcurl-env.html does not work? Then we can try to come up with a solution for those cases (for the general cases I really prefer users to use the environment variables, if possible)
Thank you.
Setting the proxy variables does work, and is indeed an additional requirement for using curl
successfully with my setup. However, the corporate (Kerberos) proxy additionally requires user authentication to allow any traffic to pass through. Thus, merely setting http_proxy
and https_proxy
is not sufficient: I get a HTTP error 407 (proxy authentication error) when trying to use curl_download()
.
Instead, I also need to supply a username and password that the proxy can use to authenticate. On my corporate machine, the Windows login details are used for proxy authentication. Setting proxyuserpwd = ":"
seems to be allow a user to authenticate using the Windows credentials (don't know how exactly, but that's what I learned). I need to set this once, globally, so that all new_handle()
calls pick it up, since I cannot handle_setopt()
on handles that other packages and functions create.
EDIT: Note that how exactly I specify proxyuserpwd
is not relevant (excuse me if I am wrong about this). Even if I were to hardcode the user name and password like so "[user name]:[password]", I currently would not be able to do it globally, as it is not an environment variable that libcurl understands - unlike http_proxy
/https_proxy
. Hence the same problem continues to apply.
By way of summary, this is the output I get when running the following commands in order (R 4.1.3):
library(curl)
# Using libcurl 7.64.1 with Schannel
file.exists("test.html")
# [1] FALSE
curl_download("http://www.google.de", "test.html")
# Error in curl_download("http://www.google.de", "test.html") :
# Timeout was reached: [] Connection timed out after 10000 milliseconds
Sys.setenv("http_proxy" = curl::ie_get_proxy_for_url())
Sys.setenv("https_proxy" = curl::ie_get_proxy_for_url())
curl_download("http://www.google.de", "test.html")
# Error in curl_download("http://www.google.de", "test.html") :
# HTTP error 407.
my_handle = new_handle()
handle_setopt(my_handle, .list = list(PROXYUSERPWD = ":"))
curl_download("http://www.google.de", "test.html", handle = my_handle)
file.exists("test.html") # download works successfully
# [1] TRUE
This Stackoverflow question also illustrates the problem well. The first reply does not address the problem (setting the right proxy is not sufficient), but the second reply shows the only workaround that I am aware of - which is to redefine new_handle()
such that certain options are always immediately set whenever a handle is created. But the author comments, rightfully imo:
I would still love to see a clean solution as changing an inner function of a library is not something one should do...
I hope this makes it clearer.
@jay @bagder I am trying to understand this part:
the user claims that setting environment variables http_proxy
and https_proxy
by itself gives HTTP 407, but setting CURLOPT_ PROXYUSERPWD simply to :
seems to fix the problem. Is this expected behavior? I feel this should somehow be the default behavior then?
ps: we're still using libcurl 7.64.1 in this case, but I don't think this particular behavior has changed in newer versions.
Sorry I did not see this until just now. I've proposed curl/curl#9087 to add to more docs that an empty username can use the Windows credentials for auth in SSPI builds. As to ALL_PROXY you can't set the URL to something like http://:@corporate_proxy:8080/ to get the same behavior, but I'm not sure if that's intentional or a bug so I've filed at curl/curl#9088
Thank you all for the discussion! This is very useful for me as I am in a very similar situation where proxy user authentication is required.
I just want to add, that instead of setting proxyuserpwd = ":"
, you can set proxyusername = ""
and this will also work. So I believe setting handle as curl::new_handle(proxyusername = "")
should also work. Assuming proxy URL is defined with environment variable.
Anyway it does not help much as I cannot set proxyusername = ""
globally.
I am attempting to use a package that uses curl under the hood to retrieve data from the Internet. This does not appear to automatically detect the proxy on my corporate network.
I have tried solutions suggested in #224
as well as
and
but these give me a 407 error after connecting to the proxy.
The following works, which tells me I have the right corporate proxy settings:
In RCurl, I can pass these with the following code which allows all RCurl requests to run:
Just wondering how to get curl to automatically use the proxy settings for all subsequent requests? Any help would be greatly appreciated.
(Pinging @MattCowgill & @HughParsonage for visibility)