EmailMessage mis-folding headers of a certain length

python / cpython

The Python programming language

https://www.python.org

Other

62.91k stars 30.13k forks source link

EmailMessage mis-folding headers of a certain length #87659

Open be898fac-47e8-415b-972c-2889a3ded25b opened 3 years ago

be898fac-47e8-415b-972c-2889a3ded25b commented 3 years ago

BPO	43493
Nosy	@warsaw, @bitdancer, @akulakov, @mglover
Files	header_misfolding.py: test code showing incorrect behavior foldfix.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', '3.8', 'expert-email'] title = 'EmailMessage mis-folding headers of a certain length' updated_at = user = 'https://github.com/mglover' ``` bugs.python.org fields: ```python activity = actor = 'r.david.murray' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'mglover' dependencies = [] files = ['49875', '49889'] hgrepos = [] issue_num = 43493 keywords = [] message_count = 5.0 messages = ['388687', '388989', '389014', '396884', '397049'] nosy_count = 4.0 nosy_names = ['barry', 'r.david.murray', 'andrei.avk', 'mglover'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue43493' versions = ['Python 3.8'] ```

be898fac-47e8-415b-972c-2889a3ded25b commented 3 years ago

The attached file demonstrates the incorrect folding behavior I'm seeing. Header lines of a certain total length get folded after the colon following the header name, which is not valid RFC. Slightly longer or shorter lines are folded correctly.

Interestingly, the test file produces correct output on 3.5.2

$ python --version
Python 3.8.5

$ sudo apt install python3
...
python3 is already the newest version (3.8.2-0ubuntu2).

(yes, that difference has me scratching my head)

And yes, I realize this is not the latest release of the 3.8 branch, but it *is the latest available through apt on Ubuntu 20.04 LTS, and a search of the issue tracker and the release notes for all of 3.8. turned up nothing applicable.

be898fac-47e8-415b-972c-2889a3ded25b commented 3 years ago

Further research shows that email.parser.Parser is not handling the affected lines correctly -- the leading '\n ' is not being stripped from the header value.

Attached is the (ugly, worksforme) function I'm using to workaround this problem

bitdancer commented 3 years ago

Parsing and newlines have nothing to do with this bug, actually. I don't think your foldfix post-processing is going to do what you want in the general case.

The source of the bug here is in the folding algorithm in _header_value_parser. It has checks to see if the "text so far" will fit within the header width, and it starts a new line under vafious conditions. For example, if there is a single word after Subject: whose length is, say, 70, it would produce the effect you show, because the single word would fit without folding or encoding on a new line. I don't think this violates the RFC. What your example shows makes it look like the folder is treating all of the text as if it were a single word, which is obviously wrong. It is supposed to break at spaces. You will note that if you increase the repeat count in your example to 16 it folds the line correctly. So the bug has something to do with the total text so far accumulated for the line being right in that window where it won't fit on the first line but does fit on a line by itself. This is obviously a bug in the folder, since it should be splitting that text if it isn't a single word, not moving it to a new line as a whole.

Note that this bug is still present on master.

akulakov commented 3 years ago

I've looked into this and it seems to be somewhat intentional, as can be seen in this test case for example:

test_headerregistry.py", line 1725, in test_fold_address_list + To: "Theodore H. Perfect" \yes@man.com\, + "My address is very long because my name is long" \foo@bar.com\, + "Only A. Friend" \no@yes.com\

Relevant code is here: https://github.com/python/cpython/blob/main/Lib/email/_header_value_parser.py#L2829-L2849

The logic goes like this: tstr = header value

try to add it to current line, if fits; continue to start of loop
try to add it to next line, if fits; continue to start of loop
split tstr; continue to start of loop

So as you can see from test case, if split happened before step 2, the name would be split over 2 lines which is not ideal.

I tested splitting before step 2, which fixed this bug but failed 11 test cases, all of which deal with email address folding. (To and From headers).

So, is this actually an issue worth fixing?

If the answer is yes, one option would be to special case Subject header and split it before going to step 2.

IOW, the block that starts here: https://github.com/python/cpython/blob/4bcef2bb48b3fd82011a89c1c716421b789f1442/Lib/email/_header_value_parser.py#L2835

would need to be moved after the next block that starts 6 lines below.

bitdancer commented 3 years ago

Ah, yes, the problem is more subtle than I thought.

The design here is that we should be starting with the largest lexical unit, seeing if that fits on the current line, or a line by itself, and if so, using that, and if not, move down to the next smaller lexical unit and try again, until we are finally left with an unbreakable unit. For unstructured headers such as Subject the lexical units should be encoded words followed by blank delimited words. I'm guessing the code is treating the collection of words it has accumulated as a unit in the above algorithm, and since it fits on a line by itself, it goes with that. So yeah, it's sort of intentional.

So the bug here is that in your step 2 we ideally want to be considering whether the last token on the current line is at the same lexical level as the token that precedes it...and if so, and if moving that token to the next line lets the remainder fit on the first line, we should do that. Exactly how to implement that correctly is a good question...it's been too long since I wrote that code, and I may not have time to investigate it more deeply.

If you come up with something based on my description of the intent above, I should be able to review it (though you might need to ping me directly to get my attention).