email.Utils.encode doesn't obey rfc2047

573f9d6f-0368-4a27-973f-b24ecda75bbb commented 22 years ago

BPO	552957
Nosy	@warsaw

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/warsaw' closed_at = created_at = labels = ['library'] title = "email.Utils.encode doesn't obey rfc2047" updated_at = user = 'https://bugs.python.org/tsarna' ``` bugs.python.org fields: ```python activity = actor = 'barry' assignee = 'barry' closed = True closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'tsarna' dependencies = [] files = [] hgrepos = [] issue_num = 552957 keywords = [] message_count = 2.0 messages = ['10662', '10663'] nosy_count = 2.0 nosy_names = ['barry', 'tsarna'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue552957' versions = [] ```

573f9d6f-0368-4a27-973f-b24ecda75bbb commented 22 years ago

The email.Utils.encoding function has two bugs, which are somewhat related -- it fails to deal with long input strings in two different ways.

First, newlines are not allowed in the middle of rfc2047 encoded-words (per section 2: "[...] white space characters MUST NOT appear between components of an 'encoded-word'"). The _bencode and _qencode routines that the encode function uses include newlines (or "=\n"'s for quopri) in their output, and the encode function doesn't remove them. Try encoding a long string with 'q' for example. The resulting output will contain one or more "= \n"'s, and the email.Utils.decode function will not be able to parse it.

Patch:

*** Utils.py.orig Mon May 6 13:17:05 2002 --- Utils.py Mon May 6 13:18:16 2002

* 98,106 ** """Encode a string according to RFC 2047.""" encoding = encoding.lower() if encoding == 'q': ! estr = _qencode(s) elif encoding == 'b': ! estr = _bencode(s) else: raise ValueError, 'Illegal encoding code: ' + encoding return '=?%s?%s?%s?=' % (charset.lower(), encoding, estr) --- 98,106 ---- """Encode a string according to RFC 2047.""" encoding = encoding.lower() if encoding == 'q': ! estr = _qencode(s).replace('=\n','') elif encoding == 'b': ! estr = _bencode(s).replace('\n','') else: raise ValueError, 'Illegal encoding code: ' + encoding return '=?%s?%s?%s?=' % (charset.lower(), encoding, estr)

NOTE: The .replace()-ing should NOT be done in _bencode and _quencode, because they're used other places where their current behaviour is fine/expected.

Second problem: rfc2047 specifies that an encoded-word may be no longer than 75 characters (see section 2). Also, in the case of, say, a From: header with high-bit characters in the sender's name, you really want to encode only the name, not the whole line, so that dumb mail programs are able to recognize the email address in the line without having to understand rfc2047.

Proposed solution: rename existing encode function (with above patche applied) to encode_word. Add a new encode function that splits the input string into a list of words and whitespace runs. Words are encoded individually using encode_word() iff they are not pure ascii. The results are then concatenated back with original whitespace.

This still leaves the possibility that a single word, when encoded, is longer than 75 characters. The recommended practice in rfc2047 is to use multiple encoded words separated by CRLF SPACE (or in our case , "\n ").

Here is code that implements the above:

wsplit = re.compile('([ \n\t]+)').split

def encode(s, charset='iso-8859-1', encoding='q'):
    i = wsplit(s)
    o = []

    # max encoded-word length per rfc2047 section 2 is 75
    # 75 - len("=?" + "?" + "?" + "?=") == 69
    max_enc_text = 69 - len(charset) - len(encoding)
    if encoding == 'q':
        # 3 bytes per character worst case
        safe_wlen = max_enc_text / 3
    elif encoding == 'b':
        safe_wlen = (max_enc_text * 6) / 8
    else:
        safe_wlen = max_enc_text # ?

    for w in i:
        if w[0] in " \n\t":
            o.append(w)
        else:
            try:
                o.append(w.encode('ascii'))
            except:
                ew = encode_word(w, charset, encoding)
                while len(ew) > 75:

o.append(encode_word(w[:safe_wlen],charset,encoding)+"\n ")
                    w = w[safe_wlen:]
                    ew = encode_word(w, charset, encoding)
                o.append(ew)

return ''.join(o)

warsaw commented 22 years ago

Logged In: YES user_id=12800

Ty, is it worth patching up email.Utils.encode() given its deprecation and the existance of the Header class? I tend to think not (there should be only one way to do it).

Is Header vulnerable to the same problems? If so, please submit a new bug report with a test case. Please also attach diffs and patches as attachments instead of in the bug report because otherwise SF will mess up the indentation.

BTW, you might want to check Python 2.3's cvs since there have been a lot of updates lately.

Thanks, I'm closing this one.

python / cpython

email.Utils.encode doesn't obey rfc2047 #36564