jeroen / curl

A Modern and Flexible Web Client for R
https://jeroen.r-universe.dev/curl
Other
216 stars 71 forks source link

Unable to set default proxy for curl requests on Windows 10 #237

Open OscarLane opened 3 years ago

OscarLane commented 3 years ago

I am attempting to use a package that uses curl under the hood to retrieve data from the Internet. This does not appear to automatically detect the proxy on my corporate network.

I have tried solutions suggested in #224

Sys.setenv(ALL_PROXY = "http://corporate_proxy:8080/")

as well as

Sys.setenv(ALL_PROXY = ":@http://corporate_proxy:8080/")

and

Sys.setenv(
  proxy = "http://corporate_proxy:8080", 
  proxyuserpwd = ":"
    )

but these give me a 407 error after connecting to the proxy.

The following works, which tells me I have the right corporate proxy settings:

h <- curl::new_handle(proxy = "http://corporate_proxy:8080", 
                      proxyuserpwd = ":")

curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", handle = h)

In RCurl, I can pass these with the following code which allows all RCurl requests to run:

options(RCurlOptions = list(
  proxy = "http://vip_webproxy:8080",
  proxyuserpwd = ":",
  proxyauth = 8
))

Just wondering how to get curl to automatically use the proxy settings for all subsequent requests? Any help would be greatly appreciated.

(Pinging @MattCowgill & @HughParsonage for visibility)

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readabs_0.4.6.900

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        pillar_1.4.6      compiler_4.0.2    cellranger_1.1.0  plyr_1.8.6       
 [6] bitops_1.0-6      tools_4.0.2       digest_0.6.25     packrat_0.5.0     evaluate_0.14    
[11] lifecycle_0.2.0   tibble_3.0.3      pkgconfig_2.0.3   rlang_0.4.7       fastmatch_1.1-0  
[16] rstudioapi_0.11   curl_4.3          yaml_2.2.1        parallel_4.0.2    xfun_0.16        
[21] dplyr_1.0.0       httr_1.4.2        knitr_1.29        xml2_1.3.2        generics_0.0.2   
[26] vctrs_0.3.2       tidyselect_1.1.0  data.table_1.13.0 glue_1.4.1        R6_2.4.1         
[31] rsdmx_0.5-14      XML_3.99-0.5      readxl_1.3.1      rmarkdown_2.3     purrr_0.3.4      
[36] tidyr_1.1.0       magrittr_1.5      hutils_1.5.1      ellipsis_0.3.1    htmltools_0.5.0  
[41] fst_0.9.2         rvest_0.3.6       stringi_1.4.6     RCurl_1.98-1.2    crayon_1.3.4     
jeroen commented 3 years ago

Have you read this vignette? https://cran.r-project.org/web/packages/curl/vignettes/windows.html

Also I think your proxyuserpwd value of : is likely an error. Maybe just not set any proxyuserpwd if you don't want to auth.

OscarLane commented 3 years ago

Hi @jeroen, thanks for following up.

On :, what you say is what I would have expected too. However, I get different results when using different handles (this is all assuming I have not set ALL_PROXY to be anything).

For example, when I run

> h <- curl::new_handle(proxy = "http://corp_webproxy:8080")
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", handle = h)
Error in curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt",  : 
  Received HTTP code 407 from proxy after CONNECT
> 
> h <- curl::new_handle(proxy = "http://corp_webproxy:8080", proxyuserpwd = ":")
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", handle = h)

the second request downloads successfully.

Though interestingly, after I have run these two curl requests, if I again run

> h <- curl::new_handle(proxy = "http://corp_webproxy:8080")
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt", handle = h)

it completes successfully. It appears curl is remembering something from the request/handle with : specified for subsequent requests.

The main aim for me, though, is to have these proxy settings automatically recognised when I start a new session. I have read the vignette you suggest a few times, and it appears to suggest the solution should be to set ALL_PROXY to my proxy as I mentioned in my initial post. However, when I do this (in a new session) I get the following:

> Sys.setenv(
+   ALL_PROXY = "http://corp_webproxy:8080"
+ )
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt")
Error in curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt") : 
  Received HTTP code 407 from proxy after CONNECT

Even when I try to include : to assist in authentication, I get the following result:

> Sys.setenv(
+   ALL_PROXY = "http://corp_webproxy:8080",
+   proxyuserpwd = ":"
+ )
> curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt")
Error in curl::curl_download("https://www.abs.gov.au/robots.txt", destfile = "robots.txt") : 
  Received HTTP code 407 from proxy after CONNECT

This suggests to me that ALL_PROXY and proxyuserpwd are not being passed successfully to subsequent curl requests. Or at least not in the same way as when I include an explicit handle.

OscarLane commented 3 years ago

I don't know if this helps at all, but in httr I can run

library(httr)
set_config(
  use_proxy(url = "http://corp_webproxy", port = 8080, auth = "ntlm", username = "", password = "")
)

and run GET requests just fine. I wonder if there is a way to pass the auth = "ntlm" argument to curl through the environment settings? Have tried using proxyauth = 8 in Sys.setenv but doesn't appear to help.

jeroen commented 3 years ago

You can see the full list of supported environment variables here: https://curl.se/libcurl/c/libcurl-env.html

To see the full list of supported curl options (which are the same as in httr) use: curl::curl_options('proxy')

Sometimes you can add the user/pass in the URL like https://username:passwd@corp_webproxy but I don't think there is a way to set curlopt_proxyauth in the environment variables. However note that by default we set it to CURLAUTH_ANY:

https://github.com/jeroen/curl/blob/4634c1c004f15897b93faa428e136a350c9422cf/src/handle.c#L180

So I am a bit surprised it doesn't work by default.

ysaidani commented 2 years ago

I can confirm that I have the same problem - which is becoming much more acute now that R 4.2.0 changed the default download method from wininet (which picks up proxy settings on Windows) to libcurl (which does not), meaning that R users on Windows behind corporate firewalls will not be able to perform actions as basic as downloading a package...

The problem is that the set of supported environment variables is significantly smaller than the set of supported curl options. Thus, there is a large number of curl options that cannot be set globally, and must be set for each handle individually. In many use cases, that is not possible or feasible (e.g. if calls to curl are buried deep in other packages). Since PROXYUSERPWD = ":" cannot be set globally (for all new handles), the only workaround seems to be to overwrite curl::new_handle(), as per this suggestion: https://stackoverflow.com/a/67838519. @OscarLane. This works for me, but ideally there ought to be a solution in curl for this.

@jeroen would it be possible at all to allow users to set global options (most urgently, for proxyuserpwd) that are picked up by all new handles?

jeroen commented 2 years ago

I can confirm that I have the same problem - which is becoming much more acute now that R 4.2.0 changed the default download method from wininet (which picks up proxy settings on Windows) to libcurl (which does not), meaning that R users on Windows behind corporate firewalls will not be able to perform actions as basic as downloading a package...

The base-R download.file() method is unrelated to the curl R package, and out of my control. To complain about base-R behavior of download.file or install.packages you need to post to https://bugs.r-project.org/. (make sure you closely read the manual page for ?download.file on proxies before you post).

would it be possible at all to allow users to set global options (most urgently, for proxyuserpwd) that are picked up by all new handles?

Not generally. The problem with global options is that different packages that build on curl start conflicting with each other, because they override each others preferences with "better defaults". We see this in other places where R uses global options, and it leads to very hard-to-debug bugs, where one package introduces side effects by changing global behavior that affects other packages.

In the case of curl, the only options that are appropriate to set globally are options that are things that the user may want to override. Proxy options is a good example, because it is not something that a package should be setting, but something specific to a given user installation. However for most of these global options, libcurl exposes environment variables already....

ysaidani commented 2 years ago

Not generally. The problem with global options is that different packages that build on curl start conflicting with each other, because they override each others preferences with "better defaults". We see this in other places where R uses global options, and it leads to very hard-to-debug bugs, where one package introduces side effects by changing global behavior that affects other packages.

Thank you for the explanation, much appreciated!

In the case of curl, the only options that are appropriate to set globally are options that are things that the user may want to override. Proxy options is a good example, because it is not something that a package should be setting, but something specific to a given user installation. However for most of these global options, libcurl exposes environment variables already....

Would proxyuserpwd be such an option potentially? Being able to set this particular option globally would solve this issue. Like a proxy, it seems like it is:

... not something that a package should be setting, but something specific to a given user installation

jeroen commented 2 years ago

Would proxyuserpwd be such an option potentially? Being able to set this particular option globally would solve this issue. Like a proxy, it seems like it is:

Yes I think that may make sense. Can you try to explain in more detail in which ways setting the http_proxy and https_proxy environment variables as described in https://curl.se/libcurl/c/libcurl-env.html does not work? Then we can try to come up with a solution for those cases (for the general cases I really prefer users to use the environment variables, if possible)

ysaidani commented 2 years ago

Thank you.

Setting the proxy variables does work, and is indeed an additional requirement for using curl successfully with my setup. However, the corporate (Kerberos) proxy additionally requires user authentication to allow any traffic to pass through. Thus, merely setting http_proxy and https_proxy is not sufficient: I get a HTTP error 407 (proxy authentication error) when trying to use curl_download().

Instead, I also need to supply a username and password that the proxy can use to authenticate. On my corporate machine, the Windows login details are used for proxy authentication. Setting proxyuserpwd = ":" seems to be allow a user to authenticate using the Windows credentials (don't know how exactly, but that's what I learned). I need to set this once, globally, so that all new_handle() calls pick it up, since I cannot handle_setopt() on handles that other packages and functions create.

EDIT: Note that how exactly I specify proxyuserpwd is not relevant (excuse me if I am wrong about this). Even if I were to hardcode the user name and password like so "[user name]:[password]", I currently would not be able to do it globally, as it is not an environment variable that libcurl understands - unlike http_proxy/https_proxy. Hence the same problem continues to apply.

By way of summary, this is the output I get when running the following commands in order (R 4.1.3):

library(curl)
# Using libcurl 7.64.1 with Schannel
file.exists("test.html")
# [1] FALSE
curl_download("http://www.google.de", "test.html")
# Error in curl_download("http://www.google.de", "test.html") : 
#   Timeout was reached: [] Connection timed out after 10000 milliseconds
Sys.setenv("http_proxy" = curl::ie_get_proxy_for_url())
Sys.setenv("https_proxy" = curl::ie_get_proxy_for_url())
curl_download("http://www.google.de", "test.html")
# Error in curl_download("http://www.google.de", "test.html") : 
#   HTTP error 407.
my_handle = new_handle()
handle_setopt(my_handle, .list = list(PROXYUSERPWD = ":"))
curl_download("http://www.google.de", "test.html", handle = my_handle)
file.exists("test.html") # download works successfully
# [1] TRUE

This Stackoverflow question also illustrates the problem well. The first reply does not address the problem (setting the right proxy is not sufficient), but the second reply shows the only workaround that I am aware of - which is to redefine new_handle() such that certain options are always immediately set whenever a handle is created. But the author comments, rightfully imo:

I would still love to see a clean solution as changing an inner function of a library is not something one should do...

I hope this makes it clearer.

jeroen commented 2 years ago

@jay @bagder I am trying to understand this part:

the user claims that setting environment variables http_proxy and https_proxy by itself gives HTTP 407, but setting CURLOPT_ PROXYUSERPWD simply to : seems to fix the problem. Is this expected behavior? I feel this should somehow be the default behavior then? ps: we're still using libcurl 7.64.1 in this case, but I don't think this particular behavior has changed in newer versions.

jay commented 2 years ago

Sorry I did not see this until just now. I've proposed curl/curl#9087 to add to more docs that an empty username can use the Windows credentials for auth in SSPI builds. As to ALL_PROXY you can't set the URL to something like http://:@corporate_proxy:8080/ to get the same behavior, but I'm not sure if that's intentional or a bug so I've filed at curl/curl#9088

djhurio commented 2 years ago

Thank you all for the discussion! This is very useful for me as I am in a very similar situation where proxy user authentication is required.

I just want to add, that instead of setting proxyuserpwd = ":", you can set proxyusername = "" and this will also work. So I believe setting handle as curl::new_handle(proxyusername = "") should also work. Assuming proxy URL is defined with environment variable.

Anyway it does not help much as I cannot set proxyusername = "" globally.