Closed rene-armida closed 8 years ago
I think the best way for this to be solved is for us to switch the format specifier for the reason phrase from %s
to %r
. Given that RFC 7230 tells us that a client should basically ignore the reason phrase anyway, we should avoid trying to interpret it.
:+1:
Extra datapoint, I've been bitten by this too while interacting with the API of a Spanish-language server. The correct fixes here are, I believe:
TEXT
which defaults to latin1
and supports other characters under RFC2047..reason
attribute as unicode if it isn't already..raise_for_status
use %r
as @Lukasa suggests (since exception messages are supposed to be bytestrings).@erydo Changing the decoding of the reason phrase is beyond the scope of requests: that is handled entirely by httplib, and we can't change it. Generally speaking that works in a way that is acceptable: it misbehaves only in a few situations on Python 2.
(Please note also that RFC 2616 is superseded by RFC 7230 in this case, which defines the reason phrase as HTAB/SP/VCHAR/obs-text, which basically allows the printable US-ASCII characters and octets from 0x80 to 0xFF. The easiest way to handle such a field is just to let it stay as bytes.)
@Lukasa I would expect that RFC obsolescence to apply more to developers of new servers rather than consumers of them.
Apache Tomcat, for example, sends down unicode latin1
reason phrases for non-en locales, and that's not a rarely-encountered server.
The RFC obsolescence applies to all developers always. RFC 7230-7233 are the specification of HTTP/1.1. RFC 7230 codifies the current best practice for deploying HTTP/1.1 and deviating from it is unwise.
In this case, as long as Tomcat is sending UTF-8 encoded data it is not deviating from RFC 7230 (though using any ABNF labelled "obsolete" is rarely a good sign). However, if Tomcat issues data using something like UTF-16 that is clearly unwise. So I think we can agree that Tomcat's behaviour could be bad depending on encoding choices.
Using bytes outside of printable US-ASCII in the HTTP header block is exceedingly unwise. Tomcat can do whatever they feel like, but I don't think that represents a sensible choice, and I'd be happy to express that opinion to any Tomcat developer who cares to ask.
Sorry, I updated my comment which I misstated: Tomcat sends the phrase down in latin1
encoding, not utf8
or utf16
. It conforms correctly to RFC2616. latin1
includes common characters like é
which are not US-ASCII and which do trigger this bug.
RFC7230 has this to say:
Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.
Which agrees with your preference to keep it as bytes.
However, the "header values" it refers to are not for typically for human consumption. The Reason Phrase is explicitly for the user; so you have to be able to decode it somehow. Given these factors:
I still believe the correct action here is decode it.
obs-text
without excluding Reason Phrase is a bug in the specification. To treat Reason Phrase as "opaque data" while simultaneously recommending that servers localize it and clients display it to the user is contradictory, especially given the existence of a previous specification that addresses those needs.I'll go hands off on this for now though, if anyone on the team would like to take a look. If rejected I can run an internally patched version, it isn't the end of the world for us of course.
It conforms correctly to RFC2616
Please stop trying to justify the wrong behaviour by saying it conforms to an obsoleted specification. You're not convincing us of anything as we've been working towards conforming with the new set of HTTP/1.1 specifications as best we can while dealing with httplib
.
I also don't understand why you're referring to sections in the RFC when the ABNF clearly states (as @Lukasa has already shown) that the reason phrase should be HTAB
, SP
, VCHAR
, and obs-text
.
I believe RFC2047's recommendation ... is a bug in the specification.
Then please file errata against the RFC to address that. Please also note that 2047 is not the latest RFC on the topic.
Regardless, I have to agree with @Lukasa that the reason string should be treated as opaque bytes. I don't understand why you think your decision to run an internally patched would sway us, since you can decode the raw reason bytes before using the reason string.
This seems to have been resolved by https://github.com/kennethreitz/requests/pull/3385
You're right. Thanks @olivierlefloch
Duplicate bug disclaimer: this is related to other Unicode + header issues, however, those all focus on errors raised while generating an outbound request. This applies only to inbound requests. (See also: #1082, #400, and to a lesser extent, #390, #409, #421, and #424; if this is already covered by another bug, sorry for bugging you guys. Thanks!)
When inspecting responses with Unicode-encoded non-latin content trailing the HTTP status header in the response, requests raises UnicodeDecodeError.
The content causing the problem, fetched via curl:
In the stack trace above, formatting
http_error_msg
as astring
won't work, becauseself.reason
contains non-asciiunicode
.Requests version: 2.9.1 Run from a fresh virtualenv with only
requests
installed.