python / cpython

The Python programming language
https://www.python.org/
Other
61.35k stars 29.55k forks source link

email: set_content() always assumes trailing EOL #121515

Open rapidcow opened 2 weeks ago

rapidcow commented 2 weeks ago

Bug report

Bug description:

The set_content() handler for Unicode strings does not differentiate between strings with and without a trailing EOL. As of Python v3.12.3 (and v3.12.4), these two calls are equivalent:

# Note: the following code blocks are meant to be run in REPL
# (I left out the >>> prompts to make copying easier)
from email.message import *; msg = EmailMessage()
msg.set_content('resumé', charset='latin-1', cte='quoted-printable')
msg.get_content() # 'resumé\n'

msg.set_content('resumé\n', charset='latin-1', cte='quoted-printable')
msg.get_content() # 'resumé\n'

...whereas set_payload() differentiates the two:

# Note: The legacy 'set_charset()' method encodes
# latin-1 charset with QP encoding by default
msg.clear_content()
msg.set_payload('resumé', charset='latin-1')
msg.get_content() # 'resumé'

msg.clear_content()
msg.set_payload('resumé\n', charset='latin-1')
msg.get_content() # 'resumé\n'

I thought that this might not be intended, considering how, in this particular case where CTE is quoted-printable, quoprimime.body_encode() (what set_payload() ultimately calls to perform QP encoding) conditionally appends that last '\n' depending on whether the original string ended with CR/LF. Meanwhile, the str set handler called by set_content() appends the '\n' (or linesep in the case of base64) unconditionally.

I was a bit hesitant to submit this as a bug because a handful of unit tests depend on this implicit EOL; some of these tests I found were:

However I can't think of a good reason why implicit EOL should be assumed... and IMHO unless there is a reason why users should be discouraged from setting content with no EOL, I think the current implementation deserves a reconsideration. :)

CPython versions tested on:

3.12

Operating systems tested on:

macOS

ZeroIntensity commented 2 weeks ago

I think this is more "odd behavior" than a bug. Anyway, it would be a breaking change to fix this. It's probably better to just update the documentation. FWIW, get_payload is a legacy method, we shouldn't try and fix things on it.

Eclips4 commented 2 weeks ago

I agreed with @ZeroIntensity. The best thing we can do here is to document the behaviour of trailing newline in docs.

rapidcow commented 2 weeks ago

Thank you both for the review! :D

FWIW, get_payload is a legacy method, we shouldn't try and fix things on it.

Tiny correction: I was specifically comparing the behavior of set_content() to set_payload(), not get_payload(). But otherwise yes -- *_payload() methods are legacy methods and I understand we should not change those.

Still there seems to be a tiny misunderstanding here... I think the misbehaving method is set_content(), not set_payload(), because it fails to account for the absence of trailing newline. Whether that would be considered breaking change is beyond my expertise, but just to clarify I am not proposing to change the legacy interface.

ZeroIntensity commented 2 weeks ago

This is fixed by #121543