michaelfm1211 commented 1 year ago

Bug report

When receiving HTTP headers in MIME encoded-word format (per RFC 2047), the http module does not decode the header's value out of encoded-word. For example:

from http import client

conn = client.HTTPConnection("localhost", 8080)
conn.request("GET", "/")

# the server is configured to return the header X-Star: =?utf-8?b?4piF?=
# which should be the ★ character (U+2605) 
resp = conn.getresponse()
print(resp.getheader('X-Star'))  # prints "=?utf-8?b?4piF?=", not "★"

Additionally, when setting a header to a string containing a non-ISO-8859-1 character, a UnicodeEncodeError exception is thrown, however, this could be solved by just using MIME encoded-word. For example:

from http import client

conn = client.HTTPConnection("localhost", 8080)
conn.request("GET", "/", headers={
    "X-Star": "★"
})
conn.close()  # UnicodeEncodeError: 'latin-1' codec can't encode character '\u2605' in position 0: ordinal not in range(256)

Your environment

CPython versions tested on: 3.11.3
Operating system and architecture: macOS 11.7.4, x86_64

Linked PRs

gh-105531
gh-105621

michaelfm1211 commented 1 year ago

After the full test suite failed on my first PR for this issue (#105531), I looked into this a bit more. I think the change would be better as two PRs:

105621 is the sending portion. This part should not have any breaking changes and should be relatively straightforward. It just handles the potential UnicodeEncodeError by falling back to RFC 2047 encoded-word.
The other part will be the receiving portion. So far I've thought of two ways to do this: either upgrade http.client to parse headers using the default email policy rather than email.policy.compat32 (which is described in more depth in issue #105622), or do it as a standalone change.

davidism commented 1 year ago

This does not seem correct. Can you point to the modern standard from https://httpwg.org/specs/ (or even an old standard) that says that HTTP clients should encode headers like this, or that servers should decode them automatically?

HTTP headers have a few different "common" formats, but each HTTP/1.1 header really needs to be treated on a case-by-case basis as many have their own quirks. The only common encoding format I've seen and implemented for Werkzeug's header parsing is for dict-like headers: Header: key1*=UTF-8''%ab, key2*=.... I would be very surprised if http.server suddenly started returning pre-decoded UTF-8 data, especially for an old email format instead of what's commonly used in HTTP.

python / cpython

`http` module does not handle MIME encoded-words in headers #105530

Bug report

Your environment

Linked PRs

105621 is the sending portion. This part should not have any breaking changes and should be relatively straightforward. It just handles the potential `UnicodeEncodeError` by falling back to RFC 2047 encoded-word.

python / cpython

`http` module does not handle MIME encoded-words in headers #105530

Bug report

Your environment

Linked PRs

105621 is the sending portion. This part should not have any breaking changes and should be relatively straightforward. It just handles the potential UnicodeEncodeError by falling back to RFC 2047 encoded-word.

105621 is the sending portion. This part should not have any breaking changes and should be relatively straightforward. It just handles the potential `UnicodeEncodeError` by falling back to RFC 2047 encoded-word.