python / cpython

The Python programming language
https://www.python.org
Other
62.68k stars 30.06k forks source link

Incorrect handling of HTTP response with "Content-Type: message/rfc822" header #73539

Open 77c7e526-d98e-4d91-a6d8-ef8b00c6f364 opened 7 years ago

77c7e526-d98e-4d91-a6d8-ef8b00c6f364 commented 7 years ago
BPO 29353
Nosy @warsaw, @orsenthil, @bitdancer, @vadmium
PRs
  • python/cpython#12214
  • Files
  • rfc822_httpmessage_payload.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'expert-email'] title = 'Incorrect handling of HTTP response with "Content-Type: message/rfc822" header' updated_at = user = 'https://bugs.python.org/brokenenglish' ``` bugs.python.org fields: ```python activity = actor = 'cschmidbauer' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'brokenenglish' dependencies = [] files = ['46395'] hgrepos = [] issue_num = 29353 keywords = ['patch'] message_count = 4.0 messages = ['286106', '286110', '286126', '291176'] nosy_count = 5.0 nosy_names = ['barry', 'orsenthil', 'r.david.murray', 'martin.panter', 'brokenenglish'] pr_nums = ['12214'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue29353' versions = ['Python 3.5', 'Python 3.6', 'Python 3.7'] ```

    77c7e526-d98e-4d91-a6d8-ef8b00c6f364 commented 7 years ago

    Hello.

    I found a bug that causes incorrect handling of some values of "Content-Type" header. When you retrieve any URL with "Content-Type: message/rfc822" header, additional payload is added to HTTPMessage. In some cases it causes annoing warnings. Here is a code from packages requests and urllib3 that can cause warning output:

    1. Checking and raising excpection in urllib3: https://github.com/shazow/urllib3/blob/0fb5e083b2adf7618db8c26e8e50206de09dd845/urllib3/util/response.py#L61-L66
    2. The same code packaged in requests: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/util/response.py#L61-L66
    3. Logging the error to output: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/connectionpool.py#L402-L407

    The issue arises from the class email.feedparser.FeedParser that handles HTTP and email headers in the same way. So when it handles headers with the message with content-type "message/*" and some other, method _parsegen() tries to parse it like a message with another message (see comments to this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l295). If headers-only argument is set to True, we can avoid some redundant checks on HTTP headers (see this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l244).

    I read the comment above (https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l241) but I am a newbie and I don't see any other solution to my problem for now. I can rewrite the patch if you show me a better way to fix this issue.

    A patch with the unit test is atached.

    bitdancer commented 7 years ago

    IMO http *should* be using headersonly=True, so while I haven't looked into this issue the solution seems plausible to me. I'm not entirely sure why it is considered a backward-compatibility hack, but I don't see any likelyhood that the headersonly API will go way.

    vadmium commented 7 years ago

    There is an inconsistency when parsing with headersonly=True. According to the documentation, get_payload() with message/rfc822 should should return a list of Message objects, not a string. But using headersonly=True produces a non-multipart Message object:

    >>> m = Parser().parsestr("Content-Type: message/rfc822\r\n\r\n", headersonly=True)
    >>> m.get_content_type()
    'message/rfc822'
    >>> m.is_multipart()  # Doc says True
    False
    >>> m.get_payload()  # Doc says list of Message objects
    ''

    Related to this, setting headersonly=True can also cause a internal inconsistency. Maybe this is why it was called a “hack”:

    >>> Parser().parsestr("Content-Type: message/delivery-status\r\nInvalid line\r\n\r\n", headersonly=True).as_string()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.5/email/message.py", line 159, in as_string
        g.flatten(self, unixfrom=unixfrom)
      File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
        self._write(msg)
      File "/usr/lib/python3.5/email/generator.py", line 181, in _write
        self._dispatch(msg)
      File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch
        meth(msg)
      File "/usr/lib/python3.5/email/generator.py", line 331, in _handle_message_delivery_status
        g.flatten(part, unixfrom=False, linesep=self._NL)
      File "/usr/lib/python3.5/email/generator.py", line 106, in flatten
        old_msg_policy = msg.policy
    AttributeError: 'str' object has no attribute 'policy'

    I think it may be best only change get_payload() to return a string in the next Python version (3.7), with appropriate documentation updates. For existing Python versions, perhaps urllib3 could check if the list returned by get_payload() only has trivial empty Message objects (no header fields and only empty payloads themselves).

    If we agree that only a feature change for 3.7 is appropriate, there are other problems with the current parsing of HTTP headers that could also be looked at:

    bitdancer commented 7 years ago

    I'm not surprised that trying to render a message parsed with 'headersonly' fails. headersonly treats the entire message body as a single string payload. I'm not sure what the correct behavior should be for the email package, but the fact that this doesn't work currently shouldn't matter to the http package. Given the error involves policy, it is quite possible as_string used to work in this context and I broke it when introducing policy because there is no test that covers this case. I'm guessing there is no test because turning a message parsed with headersonly back into a string wasn't considered to be useful.

    As for the multipart thing, that is actually consistent with the docs. get_content_type gives you the value of the header, is_multipart effectively reports whether or not the payload is a list, and when parsed with headersonly, the body is a string not a list, which is documented. Granted, you have to connect those dots yourself, so a sentence about that *could* be added to the headersonly docs.

    All of which should be in a separate issue from this one.