python / cpython

The Python programming language
https://www.python.org
Other
63.77k stars 30.54k forks source link

Newline embedded in email RFC2047 encoding raises exception when parsed #114906

Open fsc-eriker opened 10 months ago

fsc-eriker commented 10 months ago

Bug report

Bug description:

I came across some messages where the sender had embedded a newline in the From: header's display string. This was (ostensibly) a legitimate sender, possibly hoping to get more real estate for their message in the recipient's inbox listing or something, or just making a configuration mistake.

Unfortunately, this crashes Python's email parser with policy=default. (The legacy parser works fine, simply because it doesn't attempt to unpeel RFC2047 encoding by default.)

>>> from email import message_from_bytes
>>> from email.policy import default
>>> message = message_from_bytes(b'From: =?UTF-8?Q?=0AMy_self?= <me@example.org>\r\n\r\nHello\r\n', policy=default)
>>> message.items()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/fsc-eriker/git/cpython/email/message.py", line 491, in items
    return [(k, self.policy.header_fetch_parse(k, v))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fsc-eriker/git/cpython/email/message.py", line 491, in <listcomp>
    return [(k, self.policy.header_fetch_parse(k, v))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fsc-eriker/git/cpython/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fsc-eriker/git/cpython/email/headerregistry.py", line 604, in __call__
    return self[name](name, value)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fsc-eriker/git/cpython/email/headerregistry.py", line 192, in __new__
    cls.parse(value, kwds)
  File "/Users/fsc-eriker/git/cpython/email/headerregistry.py", line 346, in parse
    [Address(mb.display_name or '',
  File "/Users/fsc-eriker/git/cpython/email/headerregistry.py", line 346, in <listcomp>
    [Address(mb.display_name or '',
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/fsc-eriker/git/cpython/email/headerregistry.py", line 33, in __init__
    raise ValueError("invalid arguments; address parts cannot contain CR or LF")
ValueError: invalid arguments; address parts cannot contain CR or LF
>>> 

While the error message is correct, it should probably not raise an exception; perhaps instead register a defect?

I have tested this on 3.11 out of the box and with the current sources from the cpython Github repo; but I would expect it to manifest on all versions of the modern email module and all platforms.

CPython versions tested on:

3.11

Operating systems tested on:

macOS

terryjreedy commented 10 months ago

A crash is when the process stops without finishing or raising an exception. Stopping on uncaught exceptions like this is normal behavior. Whether this exception is a bug or a feature change request depends on what, if anything, the doc specifies. If you want to claim 'bug' please find and report a violated specification. Without reading the email doc, I can imagine that users are expected to catch such input errors and handle them however they want -- ignore the email, send a rejection to the sender if possible, or display with a modified header -- and that there are users depending on catching this exception.

fsc-eriker commented 9 months ago

Sorry if my wording is imprecise. I'll be happy to reclassify this as a feature request. The problem, as I see it, is that the current architecture allows for no user-level control over this behavior; there is no way to proceed with a message with this type of defect. My perception was that the mechanism to register defects was implemented precisely so that users of the library can decide for themselves how exactly to handle any specific email message defects.

Subjectively, I see a lot of scenarios where users (often with limited knowledge of email) want to use Python to process some messages, and get a traceback which they have no way to avoid. A simple and common use case is to extract attachments in bulk; you don't really care if the address of the sender was valid, you just want to loop over all messages, and for each find whether it contains an attachment, and if so extract it.

fsc-eriker commented 9 months ago

The raise on line 33 of headerregistry.py is quite deliberate and unconditional. However, for another recent bug report about what seems to me like a rather similar problem, I got the feedback that the problem should be handled by registering a defect, not by raising an exception, which is how I had originally tackled it in #108133

I'd appreciate additional guidance on how to proceed with this. I can see three possibilities;

For all of these, an obvious complication is that they introduce a new convention, and then the rest of the code should be adapted to obey the same convention. To quickly estimate the size of the problem,

$ grep -Fworc raise Lib/email | grep -Fve .py:0 -e __pycache__
Lib/email/contentmanager.py:9
Lib/email/_policybase.py:10
Lib/email/header.py:4
Lib/email/_encoded_words.py:1
Lib/email/_header_value_parser.py:53
Lib/email/policy.py:1
Lib/email/message.py:12
Lib/email/generator.py:1
Lib/email/utils.py:3
Lib/email/charset.py:2
Lib/email/quoprimime.py:1
Lib/email/mime/message.py:1
Lib/email/mime/application.py:1
Lib/email/mime/nonmultipart.py:1
Lib/email/mime/audio.py:1
Lib/email/mime/image.py:1
Lib/email/feedparser.py:3
Lib/email/architecture.rst:0
Lib/email/headerregistry.py:7