smtplib: support for UTF-8 encoded headers (SMTPUTF8)

python / cpython

The Python programming language

https://www.python.org

Other

62.47k stars 29.99k forks source link

smtplib: support for UTF-8 encoded headers (SMTPUTF8) #64283

Closed 3cc135ce-cc5d-4cf3-ab8e-9f81d7c17460 closed 10 years ago

3cc135ce-cc5d-4cf3-ab8e-9f81d7c17460 commented 10 years ago

BPO	20084
Nosy	@warsaw, @macfreek, @bitdancer
Superseder	bpo-8489: Support RFC 6531 in smptlib

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'library', 'expert-email'] title = 'smtplib: support for UTF-8 encoded headers (SMTPUTF8)' updated_at = user = 'https://github.com/macfreek' ``` bugs.python.org fields: ```python activity = actor = 'r.david.murray' assignee = 'none' closed = True closed_date = closer = 'r.david.murray' components = ['Library (Lib)', 'email'] creation = creator = 'macfreek' dependencies = [] files = [] hgrepos = [] issue_num = 20084 keywords = [] message_count = 7.0 messages = ['207018', '207040', '207052', '207062', '207063', '207071', '207073'] nosy_count = 3.0 nosy_names = ['barry', 'macfreek', 'r.david.murray'] pr_nums = [] priority = 'normal' resolution = 'duplicate' stage = 'resolved' status = 'closed' superseder = '8489' type = 'enhancement' url = 'https://bugs.python.org/issue20084' versions = ['Python 3.5'] ```

3cc135ce-cc5d-4cf3-ab8e-9f81d7c17460 commented 10 years ago

smtplib has no support for non-ASCII user names in the From to To mail address.

The following two calls fail:

server.rcpt(u"όνομα@example.com"): File smtplib.py, line 332, in send s = s.encode("ascii") UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128) http://hg.python.org/cpython/file/3.3/Lib/smtplib.py#l332

server.rcpt(b'\xcf\x8c\xce\xbd\xce\xbf\xce\xbc\xce\xb1@example.com'): File email/_parseaddr.py, line 236, in gotonext if self.field[self.pos] in self.LWS + '\n\r': TypeError: 'in \<string>' requires string as left operand, not int http://hg.python.org/cpython/file/3.3/Lib/email/_parseaddr.py#l236

There are two ways to solve this:

Allow users of smptlib to support internationalised email by passing already encoded headers and email addresses. The users is responsible for the encoding and setting the SMTPUTF8 ESMTP option.
Accept Unicode-encoded email addresses, and convert that to UTF-8 in the library. smtplib is responsible for the encoding and setting the SMTPUTF8 ESMTP option.

References: https://tools.ietf.org/html/rfc6531: SMTP Extension for Internationalized Email

See also bpo-20083, which deals with international domain names in email addresses (the part behind the "@"). This issue deals with the part before the "@".

Note that this is different from RFC 2047, which merely allows non-ASCII encoding in text values in the headers (such as the name of a recipient or the mail subject).

bitdancer commented 10 years ago

Duplicate of bpo-8489.

3cc135ce-cc5d-4cf3-ab8e-9f81d7c17460 commented 10 years ago

Are you sure that bpo-8489 is a duplicate? While both concern RFC 6531, the patch for 8489 only seems to add test to check how smtplib.SMTP.login() handles a username with non-ASCII characters. This issue concerns the smtplib.SMTP.rcpt() (and indirectly smtplib.SMTP.send()).

From your comment in bpo-20083 you seem to prefer that all input is in strings, not bytes. I think that is sensible, but it means that smtplib is responsible for doing the encoding, including the UTF-8 encoding instead of ASCII encoding for mails that support the SMTPUTF8 extension.

Would the following be reasonable?

The smtplib.SMTP class gets a new attribute, header_encoding
The header_encoding attribute is 'ascii' by default.
header_encoding is used by the send() method, and perhaps also by the login() method, but not by the data() method (for that, a body_encoding sounds more reasonable).
A user may set header_encoding explicitly

Open questions are:

Should the library automatically set header_encoding to UTF-8? If so, when? If the connected server announces the SMTPUTF8 extension?
What should happen if the users submits non-ASCII data in any of the headers, but the server has not announced the SMTPUTF8 extension? Currently, this raises a UnicodeEncodeError exception, but I think it should be more explicit that it is a combination of Unicode input combined with lack of support from the MTA.

bitdancer commented 10 years ago

Hmm. Perhaps it would be better to close that one as a duplicate of this one, since this one doesn't start out as an error report that then got converted into an enhancement request...

The patch on that issue doesn't have anything to do with what the issue turned into, which is indeed a bit confusing.

I haven't given much thought to *how* to implement this support. Depending on utf8smtp capability being present seems the best course. If support isn't available, then the library should continue to raise an error, as it does now, but indeed the message could be improved.

In real life, of course, we want our message to get delivered regardless of whether or not smtputf8 is available. To make that work, it will be advisable for the application to use the send_message interface rather than the sendmail interface: pass in a Message object, which can be automatically serialized as utf8 if smtputf8 is available, and the normal CTE encoding dances if not. This of course will require support from the email package, specifically a 'utf8 only' serialization mode.

If one is using smtplib only, then the application is responsible both for checking for the smtputf8 capability and branching accordingly, and for getting all the data correct...when I said "string only" I was referring to the methods in question (RCPT, etc). DATA is a different story, and that has to handle both ascii-only strings or properly encoded (per the email RFCs) binary. Automatic encoding of non-ascii string DATA is dangerous, and would only work if the input is correctly formatted for the utf8 charset throughout. Personally I'd rather use the email package to ensure that...so if an application wants to bypass the email package, I think requiring it to manually encode the DATA string into utf8 is an acceptable interface requirement, to make it *clear* that there is no way to automatically encode an arbitrary email message (other than by using the email package).

These are just preliminary thoughts...there is probably more design work to be done before this can be implemented.

bitdancer commented 10 years ago

There is another possible approach, but I haven't decided yet whether or not I like it. The email package string parser could (and may for other reasons) become smart enough to convert unicode into the charset declared in the MIME part when it is parsing the string version of a message. In that case, smtplib could use the string parser to parse the DATA payload, and if it parses successfully it can then use the same code path as I'm proposing for send_message to generate the right output depending on whether or not the smtputf8 capability is present. That would place a new constraint on what was acceptable as a DATA payload, though, so I'm not at all sure it is a good idea.

3cc135ce-cc5d-4cf3-ab8e-9f81d7c17460 commented 10 years ago

we want our message to get delivered regardless of whether or not smtputf8 is available.

This is not possible if the user specifies an (sender or recipient) email address with non-ASCII characters and the first-hop mail system does not support SMTPUTF8. Section 8 of RFC 6530 seems to suggest that in that case either an all-ASCII email address should be used, and if that is not available, the mail should bounce. In my interpretation smtplib should fail by raising an Exception.

[...] a Message object, which can be automatically serialised as utf8 if smtputf8 is available [...]

I hadn't given the mail body much thought. I think that this is covered by the existing 8BITMIME extension, in which case the client can add the header 'Content-Type: text/plain; charset="utf-8"'. From what I understand SMTPUTF8 only concerns the encoding of the header. I prefer that this particular issue (enhancement request) only concerns the mail headers, not the mail body. (I see that you also have some ideas on this, perhaps this is for a different issue?)

PS: I planned to use smtplib to see if I could understand the standard for international email addresses. Turns out I'm not reading the standard to see how smtplib should work. Also nice, but not what I had intended to do. :). It seems that STMPUTF8 is not yet implemented that much. I've learned that my production MTA does not support it.

bitdancer commented 10 years ago

Yeah, I've been doing a lot of reading of standards while trying to hide all the messy details from users of the new API I've added to the email package. I haven't gotten to smtplib yet :)

But, this stuff is messy. If you want to understand a standard, you really have to read it, and lots of others standards besides, and then look at what various packages have chosen to implement, and figure out all the ways you think they did it wrong :) As you have observed, implementations of SMTPUTF8 are scarce on the ground so far.

SMTPUTF8 may be about headers, but because the natural way of representing non-ascii headers in Python is as a (unicode) string, and SEND takes a single string (or bytes) argument, you can't separate dealing with the encoding of the headers from dealing with the encoding of the body unless you *parse* the payload as an email message so you can do the right thing with the body. Thus you can't address adding SMTPUTF8 to smtplib without figuring out the API for the whole message, not just the headers.

So yes, the client can 'add Content-Type: text/plain; charset="utf-8"', but the process of doing that is exactly what I was talking about :)

Now, one option, as I said, it to put the burden on the application: it can check to see if SMTPUTF8 is available, and if so provide a DATA formatted with utf8 headers and charset='utf-8' bodies, and if it is not available, provide a DATA formatted with RFC2047 headers and charset="utf-8" bodies. But I'd rather make smtplib (with the help of the email package) do the hard work, rather than have every application have to do it.

Still, we could start with a patch that just makes it possible for an application to do it itself. That would just need to accept non-ascii in the RCPT etc commands, pass it through as utf8 if SMTPUTF8 is available, and raise an error otherwise.

You are correct that the more convenient API I'm talking about also needs to be enhanced to provide a way to specify the alternate ASCII-only address. I'd forgotten about that detail. That's going to be very annoying from a clean-API point of view :(

And yes, it should raise an exception if SMTPUTF8 is not available and no ascii address was provided.