psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.16k stars 9.33k forks source link

Unfortunate header unicode interactions #400

Closed acdha closed 12 years ago

acdha commented 12 years ago

header_expand (https://github.com/kennethreitz/requests/blob/develop/requests/utils.py#L149) has a fast path which only supports str. Unfortunately, response.url will contain a unicode string which means that anyone doing something like requests.get(response.url, headers={'Referer': response.url}) will get a ValueError from header_expand.

Does it make more sense to simply make the header_expand check do if isinstance(headers, basestring) or have an explicit if isinstance(headers, unicode): return headers.encode("utf-8") instead? I'm happy to test either way.

kennethreitz commented 12 years ago

We have a failing test in develop right now for this actually :)

bshi commented 12 years ago

Spent some time scratching my head over this.

The language of http://www.ietf.org/rfc/rfc2047.txt suggests that headers can only be ASCII (maybe latin1?) so it seems that the behavior is correct.

Adding:

elif isinstance(headers, unicode):
    return headers.encode('ascii')

Would handle the common case of a user accidentally using reasonable unicode - and will throw a more informative error in the edge case.

kennethreitz commented 12 years ago

I've been told recently that latin1 is the proper encoding to use as well.

kennethreitz commented 12 years ago

http://greenbytes.de/tech/webdav/draft-reschke-basicauth-enc-01.html

bshi commented 12 years ago

Well, having trolled (shallowly) through a bunch RFCs futilely, it seems that in practice, some (i tested two: gunicorn, bigIP) web servers handle unicode headers gracefully. I guess it depends on how much you want to hand-hold the user, Kenneth.

[18] bshi@air:requests $ curl -vL -H 'X-Test: 中文' 'http://httpbin.org/get'
* About to connect() to httpbin.org port 80 (#0)
*   Trying 107.21.118.103... connected
> GET /get HTTP/1.1
> User-Agent: curl/7.23.1 (x86_64-apple-darwin11.2.0) libcurl/7.23.1 OpenSSL/1.0.0g zlib/1.2.5 libidn/1.22
> Host: httpbin.org
> Accept: */*
> X-Test: 中文
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 01 Feb 2012 07:28:18 GMT
< Server: gunicorn/0.13.4
< Content-Length: 413
< Connection: keep-alive
< 
{
  "url": "http://httpbin.org/get", 
  "headers": {
    "Content-Length": "", 
    "X-Forwarded-Port": "80", 
    "Connection": "keep-alive", 
    "Accept": "*/*", 
    "User-Agent": "curl/7.23.1 (x86_64-apple-darwin11.2.0) libcurl/7.23.1 OpenSSL/1.0.0g zlib/1.2.5 libidn/1.22", 
    "Host": "httpbin.org", 
    "Content-Type": "", 
    "X-Test": "\u4e2d\u6587"
  }, 
  "args": {}, 
  "origin": "107.3.137.146"
* Connection #0 to host httpbin.org left intact
}* Closing connection #0
piotr-dobrogost commented 12 years ago

SO question HTTP header should use what character encoding?

berndt commented 12 years ago

This is biting me right now. I think the above mentioned fix:

elif isinstance(headers, unicode):
    return headers.encode('ascii')

is the right way to go. For instance, I'm getting the header info straight from a json doc. json.loads automatically encodes all my strings to unicode. Py3k will be worse as unicode strings are standard (I think). Not doing this will only cause headaches. Also, the error message is not very helpful...

kennethreitz commented 12 years ago

Let's do it — except w/ latin1 instead.

acdha commented 12 years ago

I made the easy change & updated the tests to confirm that simply calling request.get with a unicode instance in the headers can now work. I'm not sure how much we care but it's still not possible to use non-latin-1 characters without manually handling the RFC 2047 encoding yourself – does it make any sense for us to attempt that here?

LucianU commented 12 years ago

The same issue exists with cookies. If you pass a dict with unicode a cryptic error appears when the translate method is called on the unicode object. I don't know if the cookies should be handled the same or if it's documented that the cookies should be strings and not unicode, but I wanted to mention this in case people weren't aware of it.

kennethreitz commented 12 years ago

Fixed!

berndt commented 12 years ago

awesome, thanks.