python / cpython

The Python programming language
https://www.python.org
Other
62.46k stars 29.98k forks source link

`http` module does not handle MIME encoded-words in headers #105530

Open michaelfm1211 opened 1 year ago

michaelfm1211 commented 1 year ago

Bug report

When receiving HTTP headers in MIME encoded-word format (per RFC 2047), the http module does not decode the header's value out of encoded-word. For example:

from http import client

conn = client.HTTPConnection("localhost", 8080)
conn.request("GET", "/")

# the server is configured to return the header X-Star: =?utf-8?b?4piF?=
# which should be the ★ character (U+2605) 
resp = conn.getresponse()
print(resp.getheader('X-Star'))  # prints "=?utf-8?b?4piF?=", not "★"

Additionally, when setting a header to a string containing a non-ISO-8859-1 character, a UnicodeEncodeError exception is thrown, however, this could be solved by just using MIME encoded-word. For example:

from http import client

conn = client.HTTPConnection("localhost", 8080)
conn.request("GET", "/", headers={
    "X-Star": "★"
})
conn.close()  # UnicodeEncodeError: 'latin-1' codec can't encode character '\u2605' in position 0: ordinal not in range(256)

Your environment

Linked PRs

michaelfm1211 commented 1 year ago

After the full test suite failed on my first PR for this issue (#105531), I looked into this a bit more. I think the change would be better as two PRs:

  1. 105621 is the sending portion. This part should not have any breaking changes and should be relatively straightforward. It just handles the potential UnicodeEncodeError by falling back to RFC 2047 encoded-word.

  2. The other part will be the receiving portion. So far I've thought of two ways to do this: either upgrade http.client to parse headers using the default email policy rather than email.policy.compat32 (which is described in more depth in issue #105622), or do it as a standalone change.
davidism commented 1 year ago

This does not seem correct. Can you point to the modern standard from https://httpwg.org/specs/ (or even an old standard) that says that HTTP clients should encode headers like this, or that servers should decode them automatically?

HTTP headers have a few different "common" formats, but each HTTP/1.1 header really needs to be treated on a case-by-case basis as many have their own quirks. The only common encoding format I've seen and implemented for Werkzeug's header parsing is for dict-like headers: Header: key1*=UTF-8''%ab, key2*=.... I would be very surprised if http.server suddenly started returning pre-decoded UTF-8 data, especially for an old email format instead of what's commonly used in HTTP.