Incorrect handling of HTTP response with "Content-Type: message/rfc822" header

77c7e526-d98e-4d91-a6d8-ef8b00c6f364 commented 7 years ago

BPO	29353
Nosy	@warsaw, @orsenthil, @bitdancer, @vadmium
PRs	python/cpython#12214
Files	rfc822_httpmessage_payload.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'expert-email'] title = 'Incorrect handling of HTTP response with "Content-Type: message/rfc822" header' updated_at = user = 'https://bugs.python.org/brokenenglish' ``` bugs.python.org fields: ```python activity = actor = 'cschmidbauer' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'brokenenglish' dependencies = [] files = ['46395'] hgrepos = [] issue_num = 29353 keywords = ['patch'] message_count = 4.0 messages = ['286106', '286110', '286126', '291176'] nosy_count = 5.0 nosy_names = ['barry', 'orsenthil', 'r.david.murray', 'martin.panter', 'brokenenglish'] pr_nums = ['12214'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue29353' versions = ['Python 3.5', 'Python 3.6', 'Python 3.7'] ```

77c7e526-d98e-4d91-a6d8-ef8b00c6f364 commented 7 years ago

Hello.

I found a bug that causes incorrect handling of some values of "Content-Type" header. When you retrieve any URL with "Content-Type: message/rfc822" header, additional payload is added to HTTPMessage. In some cases it causes annoing warnings. Here is a code from packages requests and urllib3 that can cause warning output:

Checking and raising excpection in urllib3: https://github.com/shazow/urllib3/blob/0fb5e083b2adf7618db8c26e8e50206de09dd845/urllib3/util/response.py#L61-L66
The same code packaged in requests: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/util/response.py#L61-L66
Logging the error to output: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/connectionpool.py#L402-L407

The issue arises from the class email.feedparser.FeedParser that handles HTTP and email headers in the same way. So when it handles headers with the message with content-type "message/*" and some other, method _parsegen() tries to parse it like a message with another message (see comments to this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l295). If headers-only argument is set to True, we can avoid some redundant checks on HTTP headers (see this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l244).

I read the comment above (https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l241) but I am a newbie and I don't see any other solution to my problem for now. I can rewrite the patch if you show me a better way to fix this issue.

A patch with the unit test is atached.

bitdancer commented 7 years ago

IMO http *should* be using headersonly=True, so while I haven't looked into this issue the solution seems plausible to me. I'm not entirely sure why it is considered a backward-compatibility hack, but I don't see any likelyhood that the headersonly API will go way.

vadmium commented 7 years ago

There is an inconsistency when parsing with headersonly=True. According to the documentation, get_payload() with message/rfc822 should should return a list of Message objects, not a string. But using headersonly=True produces a non-multipart Message object:

>>> m = Parser().parsestr("Content-Type: message/rfc822\r\n\r\n", headersonly=True)
>>> m.get_content_type()
'message/rfc822'
>>> m.is_multipart()  # Doc says True
False
>>> m.get_payload()  # Doc says list of Message objects
''

Related to this, setting headersonly=True can also cause a internal inconsistency. Maybe this is why it was called a “hack”:

>>> Parser().parsestr("Content-Type: message/delivery-status\r\nInvalid line\r\n\r\n", headersonly=True).as_string()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/email/message.py", line 159, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
    self._write(msg)
  File "/usr/lib/python3.5/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch
    meth(msg)
  File "/usr/lib/python3.5/email/generator.py", line 331, in _handle_message_delivery_status
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "/usr/lib/python3.5/email/generator.py", line 106, in flatten
    old_msg_policy = msg.policy
AttributeError: 'str' object has no attribute 'policy'

I think it may be best only change get_payload() to return a string in the next Python version (3.7), with appropriate documentation updates. For existing Python versions, perhaps urllib3 could check if the list returned by get_payload() only has trivial empty Message objects (no header fields and only empty payloads themselves).

If we agree that only a feature change for 3.7 is appropriate, there are other problems with the current parsing of HTTP headers that could also be looked at:

Only a blank line should end a header section (bpo-24363, bpo-26686)
“From” line should be a defect
Use “email” package’s HTTP parsing policy
Don’t assume Latin-1 encoding (bpo-27716)
Avoid double-handling (header lines are parsed in http.client, then joined together and parsed again in email.feedparser)

bitdancer commented 7 years ago

I'm not surprised that trying to render a message parsed with 'headersonly' fails. headersonly treats the entire message body as a single string payload. I'm not sure what the correct behavior should be for the email package, but the fact that this doesn't work currently shouldn't matter to the http package. Given the error involves policy, it is quite possible as_string used to work in this context and I broke it when introducing policy because there is no test that covers this case. I'm guessing there is no test because turning a message parsed with headersonly back into a string wasn't considered to be useful.

As for the multipart thing, that is actually consistent with the docs. get_content_type gives you the value of the header, is_multipart effectively reports whether or not the payload is a list, and when parsed with headersonly, the body is a string not a list, which is documented. Granted, you have to connect those dots yourself, so a sentence about that *could* be added to the headersonly docs.

All of which should be in a separate issue from this one.

python / cpython

Incorrect handling of HTTP response with "Content-Type: message/rfc822" header #73539