email.parser stops parsing headers too soon when given a defective message.

7be85f3c-f464-4458-a888-de7b31a18e30 commented 8 years ago

BPO	26686
Nosy	@warsaw, @msapiro, @bitdancer, @vadmium
Files	continue-header.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library', 'expert-email'] title = 'email.parser stops parsing headers too soon when given a defective message.' updated_at = user = 'https://github.com/msapiro' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'email'] creation = creator = 'msapiro' dependencies = [] files = ['43374'] hgrepos = [] issue_num = 26686 keywords = ['patch'] message_count = 7.0 messages = ['262750', '262751', '262776', '268438', '272295', '272332', '272381'] nosy_count = 4.0 nosy_names = ['barry', 'msapiro', 'r.david.murray', 'martin.panter'] pr_nums = [] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue26686' versions = ['Python 2.7', 'Python 3.5', 'Python 3.6'] ```

7be85f3c-f464-4458-a888-de7b31a18e30 commented 8 years ago

Given an admittedly defective (the folded Content-Type: isn't indented) message part with the following headers/body

------------------------------- Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip" Content-Type: application/x-rar-compressed; x-unix-mode=0600; name="04EBD_xxxx.xxxx_A546BB.zip" Content-Transfer-Encoding: base64

UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6 ... -------------------------------

email.parser parses the headers as

------------------------------- Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip" Content-Type: application/x-rar-compressed; x-unix-mode=0600; -------------------------------

and the body as

-------------------------------

name="04EBD_xxxx.xxxx_A546BB.zip"
Content-Transfer-Encoding: base64

UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6 ... -------------------------------

and shows no defects.

This is wrong. RFC5322 section 2.1 is clear that everything up to the first empty line is headers. Even the docstring in the email/parser.py module says "The header block is terminated either by the end of the string or by a blank line."

Since the message is defective, it isn't clear what the correct result should be, but I think

Headers: Content-Disposition: inline; filename="04EBD_xxxx.xxxx_A546BB.zip" Content-Type: application/x-rar-compressed; x-unix-mode=0600; Content-Transfer-Encoding: base64

Body: UmFyIRoHAM+QcwAADQAAAAAAAABKRXQgkC4ApAMAAEAHAAACJLrQXYFUfkgdMwkAIAAAAGEw ZjEwZi5qcwDwrrI/DB2NDI0TzcGb3Gpb8HzsS0UlpwELvdyWnVaBQt7Sl2zbJpx1qqFCGGk6 ...

Defects: name="04EBD_xxxx.xxxx_A546BB.zip"

would be more appropriate. The problem is that the Content-Transfer-Encoding: base64 header is not in the headers so that get_payload(decode=True) doesn't decode the base64 encoded body making malware recognition difficult.

7be85f3c-f464-4458-a888-de7b31a18e30 commented 8 years ago

Added Python 2.7 to versions:

vadmium commented 8 years ago

Also see bpo-24363, basically the same bug in the HTTP parser, which (ab?)uses the email package to do most of the work. In that case, according to my note the faulty header field ends with:

X-Frame-Options: SAMEORIGIN\r\n Set-Cookie: mb-CookieP=; HttpOnly; \r\n Secure\r\n Set-Cookie: mb-CookieP=; HttpOnly; Secure\r\n \r\n

But in this case, perhaps because of the implications of dropping the “Secure” flag, people are asking that the faulty line be appended to the previous header field. IMO I don’t think that is super important though. An alternative would be to add it to the defect list, and then raise an exception or warning if any defects are detected.

vadmium commented 8 years ago

FWIW in the HTTP bug \https://bugs.python.org/issue24363#msg244676\, David said “when seeing a line that doesn't look like a header the error recovery is to treat that line as the beginning of the body (ie: assume the blank line is missing).” I have no experience with email and RFC 5322 header handling, but it does make more sense to me to handle this as a defect in the header section, _not_ a genuine transition to the body (same as desired for the HTTP case).

Here is a patch that revives MalformedHeaderDefect (see bpo-14925), and continues parsing the rest of the header section instead of starting the body. But I am not sure how safe this change is. I did have to fix one unrelated set of tests (see headertest_msg in the Test8BitBytesHandling class) that did not include a blank line and was relying on the old behaviour.

vadmium commented 8 years ago

. It would be nice to get feedback if my patch is sensible, especially from people familiar with normal usage of the “email” module, as opposed with the usage by the HTTP module.

bitdancer commented 8 years ago

I would prefer if we did lookahead to see if the subsequent line looks like a header. It's more complicated to do that, of course, and could still lead to false negatives. However, I think that would probably retain enough backward compatibility to be acceptable. It would also be sensible to make this a policy switch, and as I said elsewhere I'm fine with changing the defaults of the http policy even in 3.5. (The downside of *that* is that I'm sure there are bugs hiding in the new header parsing code, so actually using the http policy to parse http headers will doubtless "allow" us to find some of them.)

Even more complicated, but a better heuristic: look ahead to the next blank line, up to some limit (5 lines?), and if you do find something that looks like a header, also make sure that none of the intermediate lines look like a MIME boundary. That still leaves the question of what to do with a source text that has non-header lines up to the next blank line (this applies to one line lookahead as well). Maybe see if there is more text after the blank line and if so assume the non-header is part of the header, otherwise not?

Regardless, lookahead may be difficult to code. So an alternative that uses your approach, but triggered by a policy setting on http, would be acceptable backward compatibility wise. If we want to we could even make an internal http policy that is compat32 plus this new flag.

vadmium commented 8 years ago

Thanks David. Since I am more intersted in fixing this robustly for HTTP and similar protocols, I might focus on just bpo-24363. Either confine my changes to the existing HTTP (or new) policy and start using that, or just address this from the HTTP package.

python / cpython

email.parser stops parsing headers too soon when given a defective message. #70873