python / cpython

The Python programming language
https://www.python.org
Other
63.1k stars 30.22k forks source link

Message from BytesParser cannot be flattened immediately #88860

Open 8bb1ccc7-7bd7-426d-82e1-439cc0687b19 opened 3 years ago

8bb1ccc7-7bd7-426d-82e1-439cc0687b19 commented 3 years ago
BPO 44694
Nosy @warsaw, @bitdancer
Files
  • 0.msg
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'expert-email', '3.9'] title = 'Message from BytesParser cannot be flattened immediately' updated_at = user = 'https://bugs.python.org/vitas1' ``` bugs.python.org fields: ```python activity = actor = 'vitas1' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'vitas1' dependencies = [] files = ['50178'] hgrepos = [] issue_num = 44694 keywords = [] message_count = 2.0 messages = ['397937', '398109'] nosy_count = 3.0 nosy_names = ['barry', 'r.david.murray', 'vitas1'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue44694' versions = ['Python 3.9'] ```

    8bb1ccc7-7bd7-426d-82e1-439cc0687b19 commented 3 years ago

    Hello. Here is my code:

    #Parse message from file and immediately flatten it
    cur_policy = email.policy.SMTPUTF8
    with open("/tmp/0.tmp", "rb") as orig_message_file:
        message_bytes = orig_message_file.read()
    message_parser = BytesParser(policy=cur_policy)
    msg = message_parser.parsebytes(message_bytes)
    with open("/tmp/1.tmp", "wb") as new_message_file:
        message_gen = BytesGenerator(new_message_file, policy=cur_policy)
    message_gen.flatten(msg)

    On some messages script raises the following error:

    Traceback (most recent call last):
      File "/misc/parsemail/./1.py", line 34, in <module>
        message_gen.flatten(msg)
      File "/usr/lib/python3.9/email/generator.py", line 116, in flatten
        self._write(msg)
      File "/usr/lib/python3.9/email/generator.py", line 199, in _write
        self._write_headers(msg)
      File "/usr/lib/python3.9/email/generator.py", line 422, in _write_headers
        self._fp.write(self.policy.fold_binary(h, v))
      File "/usr/lib/python3.9/email/policy.py", line 200, in fold_binary
        folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
      File "/usr/lib/python3.9/email/policy.py", line 214, in _fold
        return self.header_factory(name, ''.join(lines)).fold(policy=self)
      File "/usr/lib/python3.9/email/headerregistry.py", line 257, in fold
        return header.fold(policy=policy)
      File "/usr/lib/python3.9/email/_header_value_parser.py", line 156, in fold
        return _refold_parse_tree(self, policy=policy)
      File "/usr/lib/python3.9/email/_header_value_parser.py", line 2825, in _refold_parse_tree
        last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
      File "/usr/lib/python3.9/email/_header_value_parser.py", line 2913, in _fold_as_ew
        encoded_word = _ew.encode(to_encode_word, charset=encode_as)
      File "/usr/lib/python3.9/email/_encoded_words.py", line 222, in encode
        bstring = string.encode('ascii', 'surrogateescape')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)

    Policies 'default' and 'SMTP' are also affected.

    How to fix:

    #For broken messages
    message_gen = BytesGenerator(new_message_file, policy=cur_policy, maxheaderlen=0)

    Well, but parsing and flattening the same *unmodified* message should be completed without using any additional parameters, isn't it? Thanks.

    bitdancer commented 3 years ago

    I suspect maxheaderlen=0 works because it causes the original lines to be re-emitted without any folding or other processing. Without that, lines longer than the default max_line_length get refolded.

    Can you provide an example of an input message that triggers this problem?

    serhiy-storchaka commented 6 months ago

    Here is the complete example:

    #Parse message from file and immediately flatten it
    import email.policy
    from email.parser import BytesParser
    from email.generator import BytesGenerator
    cur_policy = email.policy.SMTPUTF8
    with open("0.msg", "rb") as orig_message_file:
        message_bytes = orig_message_file.read()
    
    message_parser = BytesParser(policy=cur_policy)
    msg = message_parser.parsebytes(message_bytes)
    with open("/tmp/1.tmp", "wb") as new_message_file:
        message_gen = BytesGenerator(new_message_file, policy=cur_policy)
        message_gen.flatten(msg)

    It produces the following traceback:

    Traceback (most recent call last):
      File "<stdin>", line 3, in <module>
        message_gen.flatten(msg)
        ~~~~~~~~~~~~~~~~~~~^^^^^
      File "/home/serhiy/py/cpython/Lib/email/generator.py", line 115, in flatten
        self._write(msg)
        ~~~~~~~~~~~^^^^^
      File "/home/serhiy/py/cpython/Lib/email/generator.py", line 198, in _write
        self._write_headers(msg)
        ~~~~~~~~~~~~~~~~~~~^^^^^
      File "/home/serhiy/py/cpython/Lib/email/generator.py", line 421, in _write_headers
        self._fp.write(self.policy.fold_binary(h, v))
                       ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
      File "/home/serhiy/py/cpython/Lib/email/policy.py", line 200, in fold_binary
        folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
                 ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/serhiy/py/cpython/Lib/email/policy.py", line 221, in _fold
        return self.header_factory(name, ''.join(lines)).fold(policy=self)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
      File "/home/serhiy/py/cpython/Lib/email/headerregistry.py", line 253, in fold
        return header.fold(policy=policy)
               ~~~~~~~~~~~^^^^^^^^^^^^^^^
      File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 156, in fold
        return _refold_parse_tree(self, policy=policy)
               ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
      File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 2849, in _refold_parse_tree
        last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
                  ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                              part.ew_combine_allowed, charset)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 2938, in _fold_as_ew
        encoded_word = _ew.encode(to_encode_word, charset=encode_as)
                       ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/serhiy/py/cpython/Lib/email/_encoded_words.py", line 222, in encode
        bstring = string.encode('ascii', 'surrogateescape')
                  ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    UnicodeEncodeError: 'ascii' codec can't encode character '\u0421' in position 2: ordinal not in range(128)
    serhiy-storchaka commented 6 months ago

    Here is a simple self-sufficient example:

    import email
    from email.policy import SMTPUTF8
    
    msg = email.message_from_bytes(b'''\
    Subject: =?utf-8?b?0LDQsdCy0LPSkdC00LXRlNC20LfQuNGW0ZfQudC60LvQvNC90L7Qv9GA0YHR?=
    \t=?utf-8?b?gtGD0YTRhdGG0YfRiNGJ0YzRjtGP?=''', policy=SMTPUTF8)
    print(msg)

    How to create such example: a long non-ascii string was encoded to bytes, the result was split on 45-byte chunks in the middle of the multibyte character, then the chunks were base64 encoded.

    When parse the email message, every chunk is base64-decoded, then utf-8-decoded. Since the multibyte character was split between chunks, we get surrogate escapes. Then the generator first checks if the token that contains both non-ascii characters and surrogate escapes can be encoded with the specified encoding (utf-8). It fails dues to surrogate escapes, sets charset to 'unknown-8bit', and tries to encode it with 'ascii' and 'surrogateescape'. And fails.

    A possible solution may be to split tokens that contain both non-ascii characters and surrogate escapes on parts that contains only non-ascii characters or surrogate escapes and handle them separately. Or keep separately the declared charset (which can be 'unknown-8bit') and the actual encoding.