mailgun / flanker

Python email address and Mime parsing library
http://www.mailgun.com
Apache License 2.0
1.63k stars 204 forks source link

MimePart.to_string() can produce different output depending on MimePart.was_changed() #194

Open mhahnenberg opened 6 years ago

mhahnenberg commented 6 years ago

RFC1341 states: "The Quoted-Printable encoding REQUIRES that encoded lines be no more than 76 characters long."

If we pass a long 7bit text part with simple '\n' in it, it will be encoded as quoted-printable which leads to invalid output. This is due to two things:

Relevant code is here: https://github.com/mailgun/flanker/blob/master/flanker/mime/message/part.py#L657

mhahnenberg commented 6 years ago

I can make these changes if all of this makes sense, I just wanted to discuss it first and make sure I'm understanding everything correctly.

horkhe commented 6 years ago

All is reasonable. I would prefer the soft line breaks option. @b0d0nne11 what do you think?

mhahnenberg commented 6 years ago

After some more investigation it appears quopri.encodestring does the correct thing with line lengths > 76, so that removes the need to handle the 2nd bullet. I assumed that it didn't bother with line length-related things because the documentation for the quopri module didn't explicitly mention it.

That leaves just the first bullet, which is a much easier change 😄

mhahnenberg commented 6 years ago

Okay, so I did even more investigation and I think the original problem I described doesn't exist. I think the actual issue is that when we first scan a MIME string the resulting MimePart(s) return False for was_changed(), so when we call message.to_string() we just return the string as it was scanned in without re-encoding it. This avoids all the logic of detecting which CTE should be used. The following test demonstrates this:

def text_long_line_quoted_printable():
    # TEXT_LONG_LINES is a string containing a MIME message with lines 
    # longer than allowed for quoted-printable.
    message = scan(TEXT_LONG_LINES)

    # Without the following lines this test will fail.
    # message._container._body_changed = True
    # assert message.was_changed(ignore_prepends=True)
    print(message.to_string())
    eq_('quoted-printable', message.content_encoding.value)
    eq_('ascii', message.content_type.get_charset())

Is this the desired behavior?

horkhe commented 6 years ago

@mhahnenberg the idea is that if we do not need to make changes to the mime the we return it as is. This is intentional. Originally we did encode message in to_string even if it was not changed, but some customers complained (do not remember why though) and we introduced this is_changed behavior.

@anton-efimenko may be you remember what why customers complained?

mhahnenberg commented 6 years ago

You're going to laugh at me, and I'm really sorry for continually changing what this issue is about, but this has been a very tricky bug for me to figure out 😞 However, I think I might have it now (hopefully).

flanker hardcodes line endings as CRLF by default. To encode messages as quoted-printable flanker uses quopri which tries to automatically detect which line ending to use for soft line breaks based on its input (see https://bugs.python.org/issue20121). quopri must generate soft line breaks for long lines to obey the 76 character limit imposed by the quoted-printable standard.

By default quopri assumes LF unless it runs into a CRLF first. flanker encodes each part's body in isolation, so if a part's body contains only LF line endings (or no line endings at all) quopri will assume that it should use =\n for soft line breaks when in reality we always want =\r\n since CRLF is hardcoded in flanker and it is unwise to mix line endings.

I think a quick fix for this would be for flanker to trick quopri by prepending a CRLF to the part's body and then removing it after the encoding has been performed.