email.Header.Header incorrect/non-smart on international charset address fields

1409fbaa-f956-4d99-a567-589cf071a381 commented 12 years ago

BPO	13693
Nosy	@bitdancer

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['invalid', 'type-bug', 'library'] title = 'email.Header.Header incorrect/non-smart on international charset address fields' updated_at = user = 'https://bugs.python.org/kxroberto' ``` bugs.python.org fields: ```python activity = actor = 'r.david.murray' assignee = 'none' closed = True closed_date = closer = 'r.david.murray' components = ['Library (Lib)'] creation = creator = 'kxroberto' dependencies = [] files = [] hgrepos = [] issue_num = 13693 keywords = [] message_count = 3.0 messages = ['150434', '150440', '150468'] nosy_count = 2.0 nosy_names = ['kxroberto', 'r.david.murray'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = None status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue13693' versions = ['Python 2.6', 'Python 3.1', 'Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4'] ```

1409fbaa-f956-4d99-a567-589cf071a381 commented 12 years ago

the email.* package seems to over-encode international charset address fields - resulting even in display errors in the receivers reader - , when message header composition is done as recommended in http://docs.python.org/library/email.header.html

Python 2.7.2
>>> e=email.Parser.Parser().parsestr(getcliptext())
>>> e['From']
'=?utf-8?q?Martin_v=2E_L=C3=B6wis?= <report@bugs.python.org>'
# note the par
>>> email.Header.decode_header(_)
[('Martin v. L\xc3\xb6wis', 'utf-8'), ('<report@bugs.python.org>', None)]
# unfortunately there is no comfortable function for this:
>>> u='Martin v. L\xc3\xb6wis'.decode('utf8') + ' <report@bugs.python.org>'
>>> u
u'Martin v. L\xf6wis <report@bugs.python.org>'
>>> msg=email.Message.Message()
>>> msg['From']=u
>>> msg.as_string()
'From: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\n\n'
>>> msg['From']=str(u)
>>> msg.as_string()
'From: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\nFrom: Martin v. L\xf6wis <report@bugs.python.org>\n\n'
>>> msg['From']=email.Header.Header(u)
>>> msg.as_string()
'From: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\nFrom: Martin v. L\xf6wis <report@bugs.python.org>\nFrom: =?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=\n\n'
>>>

(BTW: strange is that multiple msg['From']=... _assignments_ end up as multiple additions !??? also msg renders 8bit header lines without warning/error or auto-encoding, while it does auto on unicode!??)

Whats finally arriving at the receiver is typically like:

From: "=?utf-8?b?TWFydGluIHYuIEzDtndpcyA8cmVwb3J0QGJ1Z3MucHl0aG9uLm9yZz4=?=" \report@bugs.python.org\

because the servers seem to want the address open, they extract the address and _add_ it (duplicating) as ASCII. => error

I have not found any emails in my archives where address header fields are so over-encoded like python does. Even in non-address fields mostly only those words/groups are encoded which need it.

I assume the sophisticated/high-level looking email.* package doesn't expect that the user fiddles things together low-level? with parseaddr, re.search, make_header Header.encode , '.join ... Or is it indeed (undocumented) so? IMHO it should be auto-smart enough.

Note: there is a old deprecated function mimify.mime_encode_header which seemed to try to cautiously auto-encode correct/sparsely (but actually fails too on all examples tried).

1409fbaa-f956-4d99-a567-589cf071a381 commented 12 years ago

now I tried to render this address field header

u'Name \abc\\u03a3@xy\, abc@ewf, "Nameß" \weofij@fjeio\'

with h = email.Header.Header(continuation_ws='') h.append ... / email.Header.make_header via these chunks:

[('Name \<', us-ascii), ('abc\xce\xa3', utf-8), ('@xy>, abc@ewf, "', us-ascii), ('Name\xc3\x9f', utf-8), ('" \weofij@fjeio\', us-ascii)]

the outcome is:

'Name \< =?utf-8?b?YWJjzqM=?= @xy>, abc@ewf, " =?utf-8?b?TmFtZcOf?=\n " \weofij@fjeio\'

(note: local part of email address can be utf too)

It seems to be impossible to avoid the erronous extra spaces from outside within that email.Header framework. Thus I guess it was not possible up to now to decently format a beyond-ascii MIME message using the official email.Header mechanism? - even when pre-digesting things

bitdancer commented 12 years ago

Actually, no, the local part cannot be in anything other than ascii (see RFC 5335, which desires to address this problem among others). Also, an encoded word cannot occur inside quotation marks. If you correct those two bugs, you can generate an RFC-valid address using Header.append.

There is a project underway to make all of this header parsing and formatting stuff work better: see the http://pypi.python.org/pypi/email.

By the way, this is easier already in python 3.2. There you can do:

   >>> formataddr(('Nameß', 'weofij@fjeio'))
   '=?utf-8?b?TmFtZcOf?= <weofij@fjeio>'

python / cpython

email.Header.Header incorrect/non-smart on international charset address fields #57902