Open 77c7e526-d98e-4d91-a6d8-ef8b00c6f364 opened 7 years ago
Hello.
I found a bug that causes incorrect handling of some values of "Content-Type" header. When you retrieve any URL with "Content-Type: message/rfc822" header, additional payload is added to HTTPMessage. In some cases it causes annoing warnings. Here is a code from packages requests and urllib3 that can cause warning output:
The issue arises from the class email.feedparser.FeedParser that handles HTTP and email headers in the same way. So when it handles headers with the message with content-type "message/*" and some other, method _parsegen() tries to parse it like a message with another message (see comments to this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l295). If headers-only argument is set to True, we can avoid some redundant checks on HTTP headers (see this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l244).
I read the comment above (https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l241) but I am a newbie and I don't see any other solution to my problem for now. I can rewrite the patch if you show me a better way to fix this issue.
A patch with the unit test is atached.
IMO http *should* be using headersonly=True, so while I haven't looked into this issue the solution seems plausible to me. I'm not entirely sure why it is considered a backward-compatibility hack, but I don't see any likelyhood that the headersonly API will go way.
There is an inconsistency when parsing with headersonly=True. According to the documentation, get_payload() with message/rfc822 should should return a list of Message objects, not a string. But using headersonly=True produces a non-multipart Message object:
>>> m = Parser().parsestr("Content-Type: message/rfc822\r\n\r\n", headersonly=True)
>>> m.get_content_type()
'message/rfc822'
>>> m.is_multipart() # Doc says True
False
>>> m.get_payload() # Doc says list of Message objects
''
Related to this, setting headersonly=True can also cause a internal inconsistency. Maybe this is why it was called a “hack”:
>>> Parser().parsestr("Content-Type: message/delivery-status\r\nInvalid line\r\n\r\n", headersonly=True).as_string()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/email/message.py", line 159, in as_string
g.flatten(self, unixfrom=unixfrom)
File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
self._write(msg)
File "/usr/lib/python3.5/email/generator.py", line 181, in _write
self._dispatch(msg)
File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch
meth(msg)
File "/usr/lib/python3.5/email/generator.py", line 331, in _handle_message_delivery_status
g.flatten(part, unixfrom=False, linesep=self._NL)
File "/usr/lib/python3.5/email/generator.py", line 106, in flatten
old_msg_policy = msg.policy
AttributeError: 'str' object has no attribute 'policy'
I think it may be best only change get_payload() to return a string in the next Python version (3.7), with appropriate documentation updates. For existing Python versions, perhaps urllib3 could check if the list returned by get_payload() only has trivial empty Message objects (no header fields and only empty payloads themselves).
If we agree that only a feature change for 3.7 is appropriate, there are other problems with the current parsing of HTTP headers that could also be looked at:
I'm not surprised that trying to render a message parsed with 'headersonly' fails. headersonly treats the entire message body as a single string payload. I'm not sure what the correct behavior should be for the email package, but the fact that this doesn't work currently shouldn't matter to the http package. Given the error involves policy, it is quite possible as_string used to work in this context and I broke it when introducing policy because there is no test that covers this case. I'm guessing there is no test because turning a message parsed with headersonly back into a string wasn't considered to be useful.
As for the multipart thing, that is actually consistent with the docs. get_content_type gives you the value of the header, is_multipart effectively reports whether or not the payload is a list, and when parsed with headersonly, the body is a string not a list, which is documented. Granted, you have to connect those dots yourself, so a sentence about that *could* be added to the headersonly docs.
All of which should be in a separate issue from this one.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['3.7', 'expert-email']
title = 'Incorrect handling of HTTP response with "Content-Type: message/rfc822" header'
updated_at =
user = 'https://bugs.python.org/brokenenglish'
```
bugs.python.org fields:
```python
activity =
actor = 'cschmidbauer'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['email']
creation =
creator = 'brokenenglish'
dependencies = []
files = ['46395']
hgrepos = []
issue_num = 29353
keywords = ['patch']
message_count = 4.0
messages = ['286106', '286110', '286126', '291176']
nosy_count = 5.0
nosy_names = ['barry', 'orsenthil', 'r.david.murray', 'martin.panter', 'brokenenglish']
pr_nums = ['12214']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue29353'
versions = ['Python 3.5', 'Python 3.6', 'Python 3.7']
```