python / cpython

The Python programming language
https://www.python.org/
Other
60.91k stars 29.41k forks source link

email.parser header-only parsing records MultipartInvariantViolationDefect for valid multipart emails #106186

Open me-and opened 1 year ago

me-and commented 1 year ago

Bug report

A valid multipart email message, when parsed with email.parser.HeaderParser(policy=email.policy.default) will record a email.errors.MultipartInvariantViolationDefect.

If the parser isn't going to attempt to parse the message body, it shouldn't report that as a defect.

Simple test script:

import email.parser
import email.policy

email_str = '''\
Date: 01 Jan 2001 00:01+0000
From: arthur@example.example
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=autocracy

--autocracy
Content-Type: text/plain

By hanging on to outdated imperialist dogma which perpetuates the economic and
social differences in our society.

--autocracy
Content-Type: text/html

<html><body><p>By hanging on to outdated imperialist dogma which perpetuates
the economic and social differences in our society.</p></body></html>

--autocracy--
'''

full_parser = email.parser.Parser(policy=email.policy.default)
parsed_email_full = full_parser.parsestr(email_str)
print(parsed_email_full.defects)  # Prints [] as expected

header_parser = email.parser.HeaderParser(policy=email.policy.default)
parsed_email_headers_only = header_parser.parsestr(email_str)
print(parsed_email_headers_only.defects)  # Prints [MultipartInvariantViolationDefect()]

Your environment

Linked PRs

htsedebenham commented 11 months ago

I believe this is the expected behaviour. Per the documentation, HeaderParser acts like Parser with headersonly=True. Modifying the test script as follows, the printed value is [MultipartInvariantViolationDefect()].

email_str = '''\
Date: 01 Jan 2001 00:01+0000
From: arthur@example.example
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=autocracy

--autocracy
Content-Type: text/plain

By hanging on to outdated imperialist dogma which perpetuates the economic and
social differences in our society.

--autocracy
Content-Type: text/html

<html><body><p>By hanging on to outdated imperialist dogma which perpetuates
the economic and social differences in our society.</p></body></html>

--autocracy--
'''

full_parser = email.parser.Parser(policy=email.policy.default)
parsed_email_full = full_parser.parsestr(email_str)
print(parsed_email_full.defects)  # Prints [] as reported

full_parser = email.parser.Parser(policy=email.policy.default)
parsed_email_full = full_parser.parsestr(email_str, headersonly=True)
print(parsed_email_full.defects)  # Prints[MultipartInvariantViolationDefect()]

header_parser = email.parser.HeaderParser(policy=email.policy.default)
parsed_email_headers_only = header_parser.parsestr(email_str)
print(parsed_email_headers_only.defects)  # Prints [MultipartInvariantViolationDefect()]
htsedebenham commented 11 months ago

I see the issue, looking into it now.

ambv commented 11 months ago

Per documentation of Parser.parse():

Optional headersonly is a flag specifying whether to stop parsing after reading the headers or not. The default is False, meaning it parses the entire contents of the file.

From this reading, the issue is valid and the fix in the attached PR is the correct bugfix.