Closed acdha closed 12 years ago
We have a failing test in develop
right now for this actually :)
Spent some time scratching my head over this.
The language of http://www.ietf.org/rfc/rfc2047.txt suggests that headers can only be ASCII (maybe latin1?) so it seems that the behavior is correct.
Adding:
elif isinstance(headers, unicode):
return headers.encode('ascii')
Would handle the common case of a user accidentally using reasonable unicode - and will throw a more informative error in the edge case.
I've been told recently that latin1
is the proper encoding to use as well.
Well, having trolled (shallowly) through a bunch RFCs futilely, it seems that in practice, some (i tested two: gunicorn, bigIP) web servers handle unicode headers gracefully. I guess it depends on how much you want to hand-hold the user, Kenneth.
[18] bshi@air:requests $ curl -vL -H 'X-Test: 中文' 'http://httpbin.org/get'
* About to connect() to httpbin.org port 80 (#0)
* Trying 107.21.118.103... connected
> GET /get HTTP/1.1
> User-Agent: curl/7.23.1 (x86_64-apple-darwin11.2.0) libcurl/7.23.1 OpenSSL/1.0.0g zlib/1.2.5 libidn/1.22
> Host: httpbin.org
> Accept: */*
> X-Test: 中文
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 01 Feb 2012 07:28:18 GMT
< Server: gunicorn/0.13.4
< Content-Length: 413
< Connection: keep-alive
<
{
"url": "http://httpbin.org/get",
"headers": {
"Content-Length": "",
"X-Forwarded-Port": "80",
"Connection": "keep-alive",
"Accept": "*/*",
"User-Agent": "curl/7.23.1 (x86_64-apple-darwin11.2.0) libcurl/7.23.1 OpenSSL/1.0.0g zlib/1.2.5 libidn/1.22",
"Host": "httpbin.org",
"Content-Type": "",
"X-Test": "\u4e2d\u6587"
},
"args": {},
"origin": "107.3.137.146"
* Connection #0 to host httpbin.org left intact
}* Closing connection #0
SO question HTTP header should use what character encoding?
This is biting me right now. I think the above mentioned fix:
elif isinstance(headers, unicode):
return headers.encode('ascii')
is the right way to go. For instance, I'm getting the header info straight from a json doc. json.loads automatically encodes all my strings to unicode. Py3k will be worse as unicode strings are standard (I think). Not doing this will only cause headaches. Also, the error message is not very helpful...
Let's do it — except w/ latin1
instead.
I made the easy change & updated the tests to confirm that simply calling request.get with a unicode instance in the headers can now work. I'm not sure how much we care but it's still not possible to use non-latin-1 characters without manually handling the RFC 2047 encoding yourself – does it make any sense for us to attempt that here?
The same issue exists with cookies. If you pass a dict with unicode a cryptic error appears when the translate method is called on the unicode object. I don't know if the cookies should be handled the same or if it's documented that the cookies should be strings and not unicode, but I wanted to mention this in case people weren't aware of it.
Fixed!
awesome, thanks.
header_expand (https://github.com/kennethreitz/requests/blob/develop/requests/utils.py#L149) has a fast path which only supports
str
. Unfortunately,response.url
will contain a unicode string which means that anyone doing something likerequests.get(response.url, headers={'Referer': response.url})
will get aValueError
fromheader_expand
.Does it make more sense to simply make the
header_expand
check doif isinstance(headers, basestring)
or have an explicitif isinstance(headers, unicode): return headers.encode("utf-8")
instead? I'm happy to test either way.