http.client truncates UTF-8 encoded headers

6d5d82f3-c90b-41da-bb7f-abdec4dbac80 commented 8 years ago

BPO	27716
Nosy	@bitdancer, @vadmium, @Lukasa
Superseder	bpo-22233: http.client splits headers on non-\r\n characters
Files	header-decoding.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['library'] title = 'http.client truncates UTF-8 encoded headers' updated_at = user = 'https://github.com/Lukasa' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Lukasa' dependencies = [] files = ['44733'] hgrepos = [] issue_num = 27716 keywords = ['patch'] message_count = 7.0 messages = ['272236', '272237', '272246', '272250', '272254', '272288', '276867'] nosy_count = 3.0 nosy_names = ['r.david.murray', 'martin.panter', 'Lukasa'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = '22233' type = None url = 'https://bugs.python.org/issue27716' versions = ['Python 3.5'] ```

6d5d82f3-c90b-41da-bb7f-abdec4dbac80 commented 8 years ago

Originally reported as Requests issue bpo-3485: https://github.com/kennethreitz/requests/issues/3485

On Python 3, http.client uses the email module to parse its HTTP headers. The email module, for better or worse, requires that it parse headers as *text*: that is, that they be decoded from bytes first and then parsed.

This doesn't work for UTF-8 encoded headers. For example, the URL 'http://pl.bab.la/slownik/angielski-polski/' returns the following Link header, encoded as UTF-8: Link: <http://www.babla.cn/英语-波兰语/>; rel="alternate"; hreflang="zh-Hans", <http://cs.bab.la/slovnik/anglicky-polsky/>; rel="alternate"; hreflang="cs", <http://da.bab.la/ordbog/engelsk-polsk/>; rel="alternate"; hreflang="da", <http://de.bab.la/woerterbuch/englisch-polnisch/>; rel="alternate"; hreflang="de", <http://www.babla.gr/αγγλικα-πολωνικα/>; rel="alternate"; hreflang="el", <http://en.bab.la/dictionary/english-polish/>; rel="alternate"; hreflang="en", <http://eo.bab.la/vortaro/angla-pola/>; rel="alternate"; hreflang="eo", <http://es.bab.la/diccionario/ingles-polaco/>; rel="alternate"; hreflang="es", <http://fi.bab.la/sanakirja/englanti-puola/>; rel="alternate"; hreflang="fi", <http://fr.bab.la/dictionnaire/anglais-polonais/>; rel="alternate"; hreflang="fr", <http://www.babla.in/अंग्रेज़ी-पोलिश/>; rel="alternate"; hreflang="hi", <http://hu.bab.la/szótár/angol-lengyel/>; rel="alternate"; hreflang="hu", <http://www.babla.co.id/bahasa-inggris-bahasa-polandia/>; rel="alternate"; hreflang="id", <http://it.bab.la/dizionario/inglese-polacco/>; rel="alternate"; hreflang="it", <http://ja.bab.la/辞書/英語-ポーランド語/>; rel="alternate"; hreflang="ja", <http://www.babla.kr/영어-폴란드어/>; rel="alternate"; hreflang="ko", <http://nl.bab.la/woordenboek/engels-pools/>; rel="alternate"; hreflang="nl", <http://www.babla.no/engelsk-polsk/>; rel="alternate"; hreflang="no", <http://pl.bab.la/slownik/angielski-polski/>; rel="alternate"; hreflang="pl", <http://pt.bab.la/dicionario/ingles-polones/>; rel="alternate"; hreflang="pt", <http://ro.bab.la/dictionar/engleza-poloneza/>; rel="alternate"; hreflang="ro", <http://www.babla.ru/английский-польский/>; rel="alternate"; hreflang="ru", <http://sv.bab.la/lexikon/engelsk-polsk/>; rel="alternate"; hreflang="sv", <http://sw.bab.la/kamusi/kiingereza-kipolishi/>; rel="alternate"; hreflang="sw", <http://www.babla.co.th/english-polish/>; rel="alternate"; hreflang="th", <http://tr.bab.la/sozluk/ingilizce-lehce/>; rel="alternate"; hreflang="tr", <http://www.babla.vn/tieng-anh-tieng-ba-lan/>; rel="alternate"; hreflang="vi".

When decoded using ISO-8859-1, this header gets truncated and this also causes the header block parsing to stop. This means that we don't see the Content-Length header, causing the HTTP client to wait for connection closure to consider the body terminated.

Really the only correct fix for this is for http.client to stop insisting that the headers be decoded before they are parsed, and instead to decode *after*. That way, at least, user code can recover the headers and handle them more sensibly.

6d5d82f3-c90b-41da-bb7f-abdec4dbac80 commented 8 years ago

Simple repro case:

    import http.client
    conn = http.client.HTTPConnection('pl.bab.la')
    conn.request("GET", '/slownik/angielski-polski/')
    resp = conn.getresponse()
    resp.read()  # Hangs here

bitdancer commented 8 years ago

utf-8 headers are contrary to the http spec, aren't they? Or has that changed? (It's been a long time since I've looked at any http RFCs.)

This could be fixed by using SMTPUTF8 mode when parsing the headers, which in theory ought to be backward compatible. We could make SMTPUTF8 the default for email.policy.http, if this is correct per the RFCs.

6d5d82f3-c90b-41da-bb7f-abdec4dbac80 commented 8 years ago

Honestly, David, everything's a mess on this front. The authoritative document here is RFC 7230 Section 3.2.4 (https://tools.ietf.org/html/rfc7230#section-3.2.4). The last paragraph of that section reads:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

In the case of http.client, actually maps pretty closely to Python 3's bytes object: header field values are basically ASCII + arbitrary opaque bytes. While UTF-8 is not strictly called out as allowed, neither is it called out as forbidden.

In this case, I'd say that there's no need to be too pedantic about Latin 1 at this stage in the pipeline. Python 3 is welcome to decode using Latin 1 *after* the header block has been split, because at least then it can be fixed up due to the round-tripping nature of Latin 1. But doing it here seems to just confuse the email parser.

bitdancer commented 8 years ago

Well, email will happily parse bytes and treat the non-ascii data as opaque (though it does record errors in an internal data structure), but the python3 http api expects the parsed headers to be strings when you access them, so you'd just hit the decoding problem at that point rather than earlier.

This is a hard problem. Since headers *can* be latin1 (I'd forgotten that) SMTPUTF8 won't work. We are stuck against the problem that python makes a careful distinction between bytes and string, but http does not.

In theory we could pass bytes to email, and then provide a new API for getting at the "raw" (bytes) header so you can decode it however you want. That runs into backward compatibility problems, though, since we currently do decode from latin-1 and many programs are probably relying on that.

Throwing out an idea here: maybe having the http policy decode the parsed bytes header from latin-1 when headers are accessed through the normal API would preserve backward compatibility. I'm not too worried about back-compat in the http policy, since it is provisional until 3.6 comes out and I doubt anyone is currently using it.

vadmium commented 8 years ago

For the test case given, the main problem is actually that a header field is being incorrectly split on a Latin-1 “next line” control code U+0085. The problem is already described under bpo-22233. It looks like I wrote a patch for that a while ago, so it would be good to revisit and see if it is worth applying.

Also, the problem would have been less severe if bpo-24363 was addressed; I proposed a patch at bpo-26686 which may help.

Here are the relevant header fields returned by the server:
>>> conn.request("GET", "/slownik/angielski-polski/")
>>> pprint(conn.sock.recv(3333).splitlines(keepends=True))
[b'HTTP/1.1 200 OK\r\n',
 . . .
 b'Link: <http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0'
 b'\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", '
 . . .
 b'Transfer-Encoding: chunked\r\n',
 b'Content-Type: text/html;charset=UTF-8\r\n',
 b'\r\n',
 b'104c\r\n',
 b'<!DOCTYPE html>\n',
 . . .]

Regarding header value character encoding, revision cb09fdef19f5 is an example of where I assumed a Latin-1 transformation to handle non-ASCII redirect targets. Perhaps just document how the bytes are transformed, and how to get the original bytes back?

FWIW UTF-8 is used in RTSP, which is based on HTTP.

vadmium commented 8 years ago

Thanks to the fix for bpo-22233, now the response is parsed more sensibly, and the body can be read. The 0x85 byte now gets decoded with Latin-1:

>>> print(ascii(resp.getheader("Link")[:100]))
'<http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", <http://cs.bab.la/slov'

Here is a patch to document how to get the original bytes back (by “encoding” to Latin-1). Other than that, I don’t think there is much left to do for this bug.

python / cpython

http.client truncates UTF-8 encoded headers #71903