Message from BytesParser cannot be flattened immediately

8bb1ccc7-7bd7-426d-82e1-439cc0687b19 commented 3 years ago

BPO	44694
Nosy	@warsaw, @bitdancer
Files	0.msg

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'expert-email', '3.9'] title = 'Message from BytesParser cannot be flattened immediately' updated_at = user = 'https://bugs.python.org/vitas1' ``` bugs.python.org fields: ```python activity = actor = 'vitas1' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'vitas1' dependencies = [] files = ['50178'] hgrepos = [] issue_num = 44694 keywords = [] message_count = 2.0 messages = ['397937', '398109'] nosy_count = 3.0 nosy_names = ['barry', 'r.david.murray', 'vitas1'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue44694' versions = ['Python 3.9'] ```

8bb1ccc7-7bd7-426d-82e1-439cc0687b19 commented 3 years ago

Hello. Here is my code:

#Parse message from file and immediately flatten it
cur_policy = email.policy.SMTPUTF8
with open("/tmp/0.tmp", "rb") as orig_message_file:
    message_bytes = orig_message_file.read()
message_parser = BytesParser(policy=cur_policy)
msg = message_parser.parsebytes(message_bytes)
with open("/tmp/1.tmp", "wb") as new_message_file:
    message_gen = BytesGenerator(new_message_file, policy=cur_policy)
message_gen.flatten(msg)

On some messages script raises the following error:

Traceback (most recent call last):
  File "/misc/parsemail/./1.py", line 34, in <module>
    message_gen.flatten(msg)
  File "/usr/lib/python3.9/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib/python3.9/email/generator.py", line 199, in _write
    self._write_headers(msg)
  File "/usr/lib/python3.9/email/generator.py", line 422, in _write_headers
    self._fp.write(self.policy.fold_binary(h, v))
  File "/usr/lib/python3.9/email/policy.py", line 200, in fold_binary
    folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
  File "/usr/lib/python3.9/email/policy.py", line 214, in _fold
    return self.header_factory(name, ''.join(lines)).fold(policy=self)
  File "/usr/lib/python3.9/email/headerregistry.py", line 257, in fold
    return header.fold(policy=policy)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 156, in fold
    return _refold_parse_tree(self, policy=policy)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2825, in _refold_parse_tree
    last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2913, in _fold_as_ew
    encoded_word = _ew.encode(to_encode_word, charset=encode_as)
  File "/usr/lib/python3.9/email/_encoded_words.py", line 222, in encode
    bstring = string.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)

Policies 'default' and 'SMTP' are also affected.

How to fix:

#For broken messages
message_gen = BytesGenerator(new_message_file, policy=cur_policy, maxheaderlen=0)

Well, but parsing and flattening the same *unmodified* message should be completed without using any additional parameters, isn't it? Thanks.

bitdancer commented 3 years ago

I suspect maxheaderlen=0 works because it causes the original lines to be re-emitted without any folding or other processing. Without that, lines longer than the default max_line_length get refolded.

Can you provide an example of an input message that triggers this problem?

serhiy-storchaka commented 6 months ago

Here is the complete example:

#Parse message from file and immediately flatten it
import email.policy
from email.parser import BytesParser
from email.generator import BytesGenerator
cur_policy = email.policy.SMTPUTF8
with open("0.msg", "rb") as orig_message_file:
    message_bytes = orig_message_file.read()

message_parser = BytesParser(policy=cur_policy)
msg = message_parser.parsebytes(message_bytes)
with open("/tmp/1.tmp", "wb") as new_message_file:
    message_gen = BytesGenerator(new_message_file, policy=cur_policy)
    message_gen.flatten(msg)

It produces the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
    message_gen.flatten(msg)
    ~~~~~~~~~~~~~~~~~~~^^^^^
  File "/home/serhiy/py/cpython/Lib/email/generator.py", line 115, in flatten
    self._write(msg)
    ~~~~~~~~~~~^^^^^
  File "/home/serhiy/py/cpython/Lib/email/generator.py", line 198, in _write
    self._write_headers(msg)
    ~~~~~~~~~~~~~~~~~~~^^^^^
  File "/home/serhiy/py/cpython/Lib/email/generator.py", line 421, in _write_headers
    self._fp.write(self.policy.fold_binary(h, v))
                   ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/home/serhiy/py/cpython/Lib/email/policy.py", line 200, in fold_binary
    folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
             ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/email/policy.py", line 221, in _fold
    return self.header_factory(name, ''.join(lines)).fold(policy=self)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/email/headerregistry.py", line 253, in fold
    return header.fold(policy=policy)
           ~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 156, in fold
    return _refold_parse_tree(self, policy=policy)
           ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 2849, in _refold_parse_tree
    last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
              ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                          part.ew_combine_allowed, charset)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 2938, in _fold_as_ew
    encoded_word = _ew.encode(to_encode_word, charset=encode_as)
                   ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Lib/email/_encoded_words.py", line 222, in encode
    bstring = string.encode('ascii', 'surrogateescape')
              ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode character '\u0421' in position 2: ordinal not in range(128)

serhiy-storchaka commented 6 months ago

Here is a simple self-sufficient example:

import email
from email.policy import SMTPUTF8

msg = email.message_from_bytes(b'''\
Subject: =?utf-8?b?0LDQsdCy0LPSkdC00LXRlNC20LfQuNGW0ZfQudC60LvQvNC90L7Qv9GA0YHR?=
\t=?utf-8?b?gtGD0YTRhdGG0YfRiNGJ0YzRjtGP?=''', policy=SMTPUTF8)
print(msg)

How to create such example: a long non-ascii string was encoded to bytes, the result was split on 45-byte chunks in the middle of the multibyte character, then the chunks were base64 encoded.

When parse the email message, every chunk is base64-decoded, then utf-8-decoded. Since the multibyte character was split between chunks, we get surrogate escapes. Then the generator first checks if the token that contains both non-ascii characters and surrogate escapes can be encoded with the specified encoding (utf-8). It fails dues to surrogate escapes, sets charset to 'unknown-8bit', and tries to encode it with 'ascii' and 'surrogateescape'. And fails.

A possible solution may be to split tokens that contain both non-ascii characters and surrogate escapes on parts that contains only non-ascii characters or surrogate escapes and handle them separately. Or keep separately the declared charset (which can be 'unknown-8bit') and the actual encoding.

python / cpython

Message from BytesParser cannot be flattened immediately #88860