Closed ntoshev closed 10 years ago
I think you've misidentified the problem. Judging by your first request, we're treating the domain name just fine (and by debugging into the request logic, both requests domains are treated exactly the same). I visited the URL that causes the connection error and my browser tells me that it cannot find the server. Something about that URL is broken but it isn't requests.
For the sake of others reading this issue, I placed a trace call in requests.adapters.py
at line 316 (right before if not chunked
) and examined the prepared requests for both calls, here's what I found:
(Pdb) p request.url
'http://www.xn----8sbafg9clhjcp.bg/'
(Pdb) p request.url
'http://www.xn----8sbafg9clhjcp.bg/predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311//'
(Pdb) p conn.host
'www.xn----8sbafg9clhjcp.bg'
Continuing raises this exception:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.%D1%82%D0%B2-%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%B0.bg', port=80): Max retries exceeded with url: /predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311/ (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
Thanks for raising this issue!
The problem comes from redirects. The website in question in the failing case sends the following header:
>>> r.headers['Location']
'http://www.\xd1\x82\xd0\xb2-\xd0\xbf\xd1\x80\xd0\xbe\xd0\xb3\xd1\x80\xd0\xb0\xd0\xbc\xd0\xb0.bg/predavane/%D0%95%D0%BA%D1%81%D0%BF%D0%B5%D0%B4%D0%B8%D1%86%D0%B8%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%94%D0%B6%D0%B5%D1%84-%D0%9A%D0%BE%D1%80%D1%83%D0%B8%D0%BD,1771305311/'
Note that this is a UTF-8 encoded string, which is in violation of RFC 2616:
Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
Of course, if it was encoded in that manner we still would have fallen over so that's not all that helpful to us (note: should we support decoding RFC 2047 headers in future? Might be nice).
More generally, what are we supposed to do here? We could put the header through our full header processing, but then we'd have to decode the header as UTF-8 and that violates spec. If we decoded as Latin-1 we'd still not be able to reach it.
I'd argue the server should have sent the IDNA-encoded hostname, not the UTF-8 encoded one, but I'm not enough of an expert to be sure. I'll ask on Stack Overflow.
And our answer appears: the upstream server is at fault. From SO:
It must be a valid HTTP URI (as per RFCs 3986 and 7230), thus non-ASCII characters in the host name will need to be IDNA-encoded.
Wow, thanks for tracking this down!
This case is not critical to me, but in general I would expect that whatever works in browsers works in requests as well, even if the server is not behaving correctly. This page works in Chrome and fails in Firefox, so it's in a grey area really (actually in Firefox it doesn't load the first time, but the target location is shown in the address bar and if you press Enter again, it loads).
Yeah, the browsers have trouble here, but they have an advantage we don't: they can easily speculatively perform DNS resolution on the hostnames. This means that they can receive the Location
header and immediately perform asynchronous DNS lookups on all those hostnames, attempting to work out which one it might be based on which hostname exists.
This is really difficult for requests to do because requests is fundamentally synchronous. Spawning three or four DNS lookups to try to find the right one basically requires us to either:
Ultimately, we're between a rock and a hard place. Browsers will always be able to do things we can't do, because they're bigger, faster and have a more specific use-case. We need to be able to do our best. In this case, we can let you be like Firefox, and take control of the redirection yourself:
s = requests.Session()
r = s.get(url, allow_redirects=False)
while r.status_code in range(300, 400):
r = s.get(r.headers['Location'], allow_redirects=False)
in general I would expect that whatever works in browsers works in requests as well, even if the server is not behaving correctly.
Requests is not a browser. It stores cookies for you and handles redirects. There is a lot more that a browser does that requests does not. For example, browsers
Expecting requests to behave like a browser is unreasonable not only because there's so much we can't do, but also because there's no way we can make decisions about what we should do in undefined or poorly defined cases.
Unless I'm misunderstanding the comments here, this issue can be closed, right?
The domain gets URL-encoded:
That URL appears in a webpage when crawling, and it works when pasted in a browser.