rene-armida commented 8 years ago

Duplicate bug disclaimer: this is related to other Unicode + header issues, however, those all focus on errors raised while generating an outbound request. This applies only to inbound requests. (See also: #1082, #400, and to a lesser extent, #390, #409, #421, and #424; if this is already covered by another bug, sorry for bugging you guys. Thanks!)

When inspecting responses with Unicode-encoded non-latin content trailing the HTTP status header in the response, requests raises UnicodeDecodeError.

(requests) marmida@loquat:~$ python
Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> resp = requests.get('http://jp.apps.gree.net/ja/1604')
>>> resp.ok                                                                                                      
Traceback (most recent call last):                                                                               
  File "<stdin>", line 1, in <module>                                                                            
  File "/home/marmida/venvs/requests/local/lib/python2.7/site-packages/requests/models.py", line 623, in ok      
    self.raise_for_status()                                                                                      
  File "/home/marmida/venvs/requests/local/lib/python2.7/site-packages/requests/models.py", line 837, in raise_for_status                                                                                                         
    http_error_msg = '%s Server Error: %s for url: %s' % (self.status_code, self.reason, self.url)               
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 23: ordinal not in range(128)

The content causing the problem, fetched via curl:

$ curl -v 'http://jp.apps.gree.net/ja/1604' 
* Connected to jp.apps.gree.net (116.93.155.51) port 80 (#0)
> GET /ja/1604 HTTP/1.1
> Host: jp.apps.gree.net
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 503 AKB48ステージファイター
< Server: nginx

In the stack trace above, formatting http_error_msg as a string won't work, because self.reason contains non-ascii unicode.

Requests version: 2.9.1 Run from a fresh virtualenv with only requests installed.

Lukasa commented 8 years ago

I think the best way for this to be solved is for us to switch the format specifier for the reason phrase from %s to %r. Given that RFC 7230 tells us that a client should basically ignore the reason phrase anyway, we should avoid trying to interpret it.

kennethreitz commented 8 years ago

:+1:

rwe commented 8 years ago

Extra datapoint, I've been bitten by this too while interacting with the API of a Spanish-language server. The correct fixes here are, I believe:

Properly decode the HTTP Reason Phrase field as described by @denis-ryzhkov in #1181 for header values.
- According to RFC2616, Reason Phrase is TEXT which defaults to latin1 and supports other characters under RFC2047.
Store the .reason attribute as unicode if it isn't already.
In .raise_for_status use %r as @Lukasa suggests (since exception messages are supposed to be bytestrings).

Lukasa commented 8 years ago

@erydo Changing the decoding of the reason phrase is beyond the scope of requests: that is handled entirely by httplib, and we can't change it. Generally speaking that works in a way that is acceptable: it misbehaves only in a few situations on Python 2.

Lukasa commented 8 years ago

(Please note also that RFC 2616 is superseded by RFC 7230 in this case, which defines the reason phrase as HTAB/SP/VCHAR/obs-text, which basically allows the printable US-ASCII characters and octets from 0x80 to 0xFF. The easiest way to handle such a field is just to let it stay as bytes.)

rwe commented 8 years ago

@Lukasa I would expect that RFC obsolescence to apply more to developers of new servers rather than consumers of them.

Apache Tomcat, for example, sends down ~~unicode~~ latin1 reason phrases for non-en locales, and that's not a rarely-encountered server.

Lukasa commented 8 years ago

The RFC obsolescence applies to all developers always. RFC 7230-7233 are the specification of HTTP/1.1. RFC 7230 codifies the current best practice for deploying HTTP/1.1 and deviating from it is unwise.

In this case, as long as Tomcat is sending UTF-8 encoded data it is not deviating from RFC 7230 (though using any ABNF labelled "obsolete" is rarely a good sign). However, if Tomcat issues data using something like UTF-16 that is clearly unwise. So I think we can agree that Tomcat's behaviour could be bad depending on encoding choices.

Using bytes outside of printable US-ASCII in the HTTP header block is exceedingly unwise. Tomcat can do whatever they feel like, but I don't think that represents a sensible choice, and I'd be happy to express that opinion to any Tomcat developer who cares to ask.

rwe commented 8 years ago

Sorry, I updated my comment which I misstated: Tomcat sends the phrase down in latin1 encoding, not utf8 or utf16. It conforms correctly to RFC2616. latin1 includes common characters like é which are not US-ASCII and which do trigger this bug.

RFC7230 has this to say:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

Which agrees with your preference to keep it as bytes.

However, the "header values" it refers to are not for typically for human consumption. The Reason Phrase is explicitly for the user; so you have to be able to decode it somehow. Given these factors:

The explicit purpose of the Reason Phrase is to be shown to the user.
RFC2047 offers no useful guidance on how to interpret the Reason Phrase for non-ASCII characters.
RFC2616 does offer useful guidance for how to do that for non-ASCII characters, and in a way that is not incompatible with RFC2047.
Common, in-production servers send responses conforming with RFC2616.

I still believe the correct action here is decode it.

rwe commented 8 years ago

3034 proposes a solution to this with more detail on my reasoning behind it. Specifically, I believe RFC2047's recommendation (which only discusses headers) about the treatment of `obs-text` without excluding Reason Phrase is a bug in the specification. To treat Reason Phrase as "opaque data" while simultaneously recommending that servers localize it and clients display it to the user is contradictory, especially given the existence of a previous specification that addresses those needs.

I'll go hands off on this for now though, if anyone on the team would like to take a look. If rejected I can run an internally patched version, it isn't the end of the world for us of course.

sigmavirus24 commented 8 years ago

It conforms correctly to RFC2616

Please stop trying to justify the wrong behaviour by saying it conforms to an obsoleted specification. You're not convincing us of anything as we've been working towards conforming with the new set of HTTP/1.1 specifications as best we can while dealing with httplib.

I also don't understand why you're referring to sections in the RFC when the ABNF clearly states (as @Lukasa has already shown) that the reason phrase should be HTAB, SP, VCHAR, and obs-text.

I believe RFC2047's recommendation ... is a bug in the specification.

Then please file errata against the RFC to address that. Please also note that 2047 is not the latest RFC on the topic.

Regardless, I have to agree with @Lukasa that the reason string should be treated as opaque bytes. I don't understand why you think your decision to run an internally patched would sway us, since you can decode the raw reason bytes before using the reason string.

olivierlefloch commented 8 years ago

This seems to have been resolved by https://github.com/kennethreitz/requests/pull/3385

sigmavirus24 commented 8 years ago

You're right. Thanks @olivierlefloch

psf / requests

Inspecting 'response.ok' with non-ascii HTTP status "reason" content causes UnicodeDecodeError #3009