Sometimes Requests doesn't handle properly domains encoded in Cyrillic

ntoshev commented 10 years ago

The domain gets URL-encoded:

>>> r=requests.get(u'http://www.тв-програма.bg')
>>> r.status_code
200
>>> requests.get(u'http://www.тв-програма.bg/predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311//')
Traceback (most recent call last):
  File "<pyshell#20>", line 1, in <module>
    requests.get(u'http://www.тв-програма.bg/predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311//')
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 585, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 179, in resolve_redirects
    allow_redirects=False,
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 559, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 375, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='www.%D1%82%D0%B2-%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%B0.bg', port=80): Max retries exceeded with url: /predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311/ (Caused by <class 'socket.gaierror'>: [Errno -2] Name or service not known)

That URL appears in a webpage when crawling, and it works when pasted in a browser.

sigmavirus24 commented 10 years ago

I think you've misidentified the problem. Judging by your first request, we're treating the domain name just fine (and by debugging into the request logic, both requests domains are treated exactly the same). I visited the URL that causes the connection error and my browser tells me that it cannot find the server. Something about that URL is broken but it isn't requests.

For the sake of others reading this issue, I placed a trace call in requests.adapters.py at line 316 (right before if not chunked) and examined the prepared requests for both calls, here's what I found:

(Pdb) p request.url
'http://www.xn----8sbafg9clhjcp.bg/'
(Pdb) p request.url
'http://www.xn----8sbafg9clhjcp.bg/predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311//'
(Pdb) p conn.host
'www.xn----8sbafg9clhjcp.bg'

Continuing raises this exception:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.%D1%82%D0%B2-%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%B0.bg', port=80): Max retries exceeded with url: /predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311/ (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)

Lukasa commented 10 years ago

Thanks for raising this issue!

The problem comes from redirects. The website in question in the failing case sends the following header:

>>> r.headers['Location']
'http://www.\xd1\x82\xd0\xb2-\xd0\xbf\xd1\x80\xd0\xbe\xd0\xb3\xd1\x80\xd0\xb0\xd0\xbc\xd0\xb0.bg/predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311/'

Note that this is a UTF-8 encoded string, which is in violation of RFC 2616:

Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

Of course, if it was encoded in that manner we still would have fallen over so that's not all that helpful to us (note: should we support decoding RFC 2047 headers in future? Might be nice).

More generally, what are we supposed to do here? We could put the header through our full header processing, but then we'd have to decode the header as UTF-8 and that violates spec. If we decoded as Latin-1 we'd still not be able to reach it.

I'd argue the server should have sent the IDNA-encoded hostname, not the UTF-8 encoded one, but I'm not enough of an expert to be sure. I'll ask on Stack Overflow.

Lukasa commented 10 years ago

Stack overflow question is here.

Lukasa commented 10 years ago

And our answer appears: the upstream server is at fault. From SO:

It must be a valid HTTP URI (as per RFCs 3986 and 7230), thus non-ASCII characters in the host name will need to be IDNA-encoded.

ntoshev commented 10 years ago

Wow, thanks for tracking this down!

This case is not critical to me, but in general I would expect that whatever works in browsers works in requests as well, even if the server is not behaving correctly. This page works in Chrome and fails in Firefox, so it's in a grey area really (actually in Firefox it doesn't load the first time, but the target location is shown in the address bar and if you press Enter again, it loads).

Lukasa commented 10 years ago

Yeah, the browsers have trouble here, but they have an advantage we don't: they can easily speculatively perform DNS resolution on the hostnames. This means that they can receive the Location header and immediately perform asynchronous DNS lookups on all those hostnames, attempting to work out which one it might be based on which hostname exists.

This is really difficult for requests to do because requests is fundamentally synchronous. Spawning three or four DNS lookups to try to find the right one basically requires us to either:

Spawn threads to do the DNS lookups. Libraries should never spawn their own threads, so this is unappealing.
Do the DNS lookups synchronously. DNS can be very slow, and doing two or three extra DNS lookups will add many hundreds of ms to our resolution time. This is also unappealing.

Ultimately, we're between a rock and a hard place. Browsers will always be able to do things we can't do, because they're bigger, faster and have a more specific use-case. We need to be able to do our best. In this case, we can let you be like Firefox, and take control of the redirection yourself:

s = requests.Session()
r = s.get(url, allow_redirects=False)

while r.status_code in range(300, 400):
    r = s.get(r.headers['Location'], allow_redirects=False)

sigmavirus24 commented 10 years ago

in general I would expect that whatever works in browsers works in requests as well, even if the server is not behaving correctly.

Requests is not a browser. It stores cookies for you and handles redirects. There is a lot more that a browser does that requests does not. For example, browsers

allow websites to store flash and other types of cookies
run JavaScript that can manipulate the document you receive initially
use CSS media queries to change how the document is rendered
will ignore blatantly bad data returned by a server and do something "intelligent" with it
may stop a server from sending an endless stream of headers
and much much more

Expecting requests to behave like a browser is unreasonable not only because there's so much we can't do, but also because there's no way we can make decisions about what we should do in undefined or poorly defined cases.

Unless I'm misunderstanding the comments here, this issue can be closed, right?

psf / requests

Sometimes Requests doesn't handle properly domains encoded in Cyrillic #2081