Request https url via http proxy

RafTim commented 9 years ago

I recently tried to make a connection to a https url via a http proxy which doesn't support the 'connect' method (Crawlera http://scrapinghub.com/faq#https). I'm using the following code:

import requests proxies = {"http": "http://USER:PASS@paygo.crawlera.com:8010/", {"https:": "http://USER:PASS@paygo.crawlera.com:8010/"}

requests.get("https://wikipedia.org", proxies=proxies)

will result in

Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/site-packages/requests/api.py", line 55, in get return request('get', url, _kwargs) File "/usr/lib/python2.7/site-packages/requests/api.py", line 44, in request return session.request(method=method, url=url, _kwargs) File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 456, in request resp = self.send(prep, _send_kwargs) File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 559, in send r = adapter.send(request, _kwargs) File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 378, in send raise ProxyError(e)

Is there anything to do about that?

Lukasa commented 9 years ago

A HTTP proxy that does not support the CONNECT verb cannot transport a HTTPS request.

RafTim commented 9 years ago

According to that proxy: "... configure your HTTP client to use a HTTP proxy even for HTTPS URLs. However, not many clients support this for privacy reasons that don't apply to web crawling. cURL doesn't support it, but lwp-request does." If "this is not possible" how does lwp-request support it?

Lukasa commented 9 years ago

Let's be clear about what you're asking requests to do.

A HTTPS request establishes a secure connection between your machine and the origin server. This means you encrypt the connection using TLS, which means you authenticate the origin server. This authentication is done using TLS certificates.

Combining this with proxies is tricky. The proxy can't be the other end of the TLS connection because it doesn't have the correct certificate (nor should it!). This means all you can possibly do is establish a TCP tunnel through the proxy: you connect to the proxy, the proxy connects to the remote end, and then all packets just get forwarded through. This allows you to still perform the TLS handshake. That's exactly what the CONNECT verb is for: it establishes that TCP tunnel.

To do this without the CONNECT verb requires that you make the TLS connection with the proxy instead, then use it like a standard HTTP proxy. This is a bad idea: it represents a man-in-the-middle attack. I can't stress this enough: you need to ensure that you trust Crawlera before doing this.

If you really want to do it, requests should be able to: set the scheme of the proxy URL you pass us to https rather than http. That will cause us to establish the TLS connection with the proxy itself.

That means your code changes to:

import requests
proxies = {"http": "http://USER:PASS@paygo.crawlera.com:8010/",
{"https:": "https://USER:PASS@paygo.crawlera.com:8010/"}

requests.get("https://wikipedia.org", proxies=proxies)

psf / requests

Request https url via http proxy #2387