python / cpython

The Python programming language
https://www.python.org
Other
62.72k stars 30.07k forks source link

email parser ignores inner multipart boundary when outer message duplicates it #69914

Open 1c40e693-dd98-4eaa-abc8-88ed58988184 opened 8 years ago

1c40e693-dd98-4eaa-abc8-88ed58988184 commented 8 years ago
BPO 25728
Nosy @warsaw, @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'expert-email'] title = 'email parser ignores inner multipart boundary when outer message duplicates it' updated_at = user = 'https://bugs.python.org/forest' ``` bugs.python.org fields: ```python activity = actor = 'r.david.murray' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'forest' dependencies = [] files = [] hgrepos = [] issue_num = 25728 keywords = [] message_count = 6.0 messages = ['255309', '255310', '255312', '255313', '255314', '255357'] nosy_count = 3.0 nosy_names = ['barry', 'forest', 'r.david.murray'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue25728' versions = ['Python 2.7', 'Python 3.5', 'Python 3.6'] ```

1c40e693-dd98-4eaa-abc8-88ed58988184 commented 8 years ago

When a multipart message erroneously defines a boundary string that conflicts with an inner message's boundary string, the parser ignores the (correct) inner message's boundary, and treats all matching boundary lines as if they belong to the (defective) outer message.

This file from the test_email suite demonstrates the problem: Python-3.5.0/Lib/test/test_email/data/msg_15.txt

Consequentially, the inner multipart/alternative message is parsed with is_multipart() returning False, and a truncated payload.

Moreover, unit tests like test_same_boundary_inner_outer() expect to find the StartBoundaryNotFoundDefect defect on the inner message in that file, which seems wrong to me, since the inner message is not defective. According to the RFCs, the outer message should have been generated with a boundary string that does not appear anywhere in its encoded body (including the inner message). The outer message is therefore the defective one.

1c40e693-dd98-4eaa-abc8-88ed58988184 commented 8 years ago

I thought at first that this might be deliberate behavior in order to comply with RFC 2046 section 5.1.2. https://tools.ietf.org/html/rfc2046#section-5.1.2

After carefully re-reading that section, I see that it is just making sure an outer message's boundary will still be recognized if an inner multipart message is missing its boundary markers (for example if the inner message was truncated). It does not describe any circumstances under which the inner message's boundary markers should be ignored when they are present.

bitdancer commented 8 years ago

Who is to say that the outer message is defective and not the inner one? How can a parser decide which part belongs to which message? It isn't an AI.

The whole message is defective, so all bets are off :) The library can't successfully parse such a message, though it goes to significant pains to make sure it never generates one.

1c40e693-dd98-4eaa-abc8-88ed58988184 commented 8 years ago

RFC 2046 says that the outer message is defective, since it uses a boundary delimiter that is quite obviously present inside one of the encapsulated parts:

https://tools.ietf.org/html/rfc2046#section-5.1

"The boundary delimiter MUST NOT appear inside any of the encapsulated parts, on a line by itself or as the prefix of any line."

1c40e693-dd98-4eaa-abc8-88ed58988184 commented 8 years ago

The library can't successfully parse such a message

It could successfully parse such a message, if it matched against inner message boundaries before outer message boundaries. (One implementation would be to keep a list of all ancestor boundaries and traverse the list from most recent to least recent, but there might be more efficient ways to do it.)

bitdancer commented 8 years ago

I am open to (and will review) a patch that applies simple heuristics to trying to guess correctly about such messages, but only if it doesn't add too much complexity to the parser. I'm not certain I would consider it for a bug fix release, but I'll postpone that decision until I review the patch (the issue is: would it have the potential to break applications that are currently working? I'm guessing not, but I tend to be cautious about such issues.)