Open 8bb1ccc7-7bd7-426d-82e1-439cc0687b19 opened 3 years ago
Hello. Here is my code:
#Parse message from file and immediately flatten it
cur_policy = email.policy.SMTPUTF8
with open("/tmp/0.tmp", "rb") as orig_message_file:
message_bytes = orig_message_file.read()
message_parser = BytesParser(policy=cur_policy)
msg = message_parser.parsebytes(message_bytes)
with open("/tmp/1.tmp", "wb") as new_message_file:
message_gen = BytesGenerator(new_message_file, policy=cur_policy)
message_gen.flatten(msg)
On some messages script raises the following error:
Traceback (most recent call last):
File "/misc/parsemail/./1.py", line 34, in <module>
message_gen.flatten(msg)
File "/usr/lib/python3.9/email/generator.py", line 116, in flatten
self._write(msg)
File "/usr/lib/python3.9/email/generator.py", line 199, in _write
self._write_headers(msg)
File "/usr/lib/python3.9/email/generator.py", line 422, in _write_headers
self._fp.write(self.policy.fold_binary(h, v))
File "/usr/lib/python3.9/email/policy.py", line 200, in fold_binary
folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
File "/usr/lib/python3.9/email/policy.py", line 214, in _fold
return self.header_factory(name, ''.join(lines)).fold(policy=self)
File "/usr/lib/python3.9/email/headerregistry.py", line 257, in fold
return header.fold(policy=policy)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 156, in fold
return _refold_parse_tree(self, policy=policy)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2825, in _refold_parse_tree
last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2913, in _fold_as_ew
encoded_word = _ew.encode(to_encode_word, charset=encode_as)
File "/usr/lib/python3.9/email/_encoded_words.py", line 222, in encode
bstring = string.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)
Policies 'default' and 'SMTP' are also affected.
How to fix:
#For broken messages
message_gen = BytesGenerator(new_message_file, policy=cur_policy, maxheaderlen=0)
Well, but parsing and flattening the same *unmodified* message should be completed without using any additional parameters, isn't it? Thanks.
I suspect maxheaderlen=0 works because it causes the original lines to be re-emitted without any folding or other processing. Without that, lines longer than the default max_line_length get refolded.
Can you provide an example of an input message that triggers this problem?
Here is the complete example:
#Parse message from file and immediately flatten it
import email.policy
from email.parser import BytesParser
from email.generator import BytesGenerator
cur_policy = email.policy.SMTPUTF8
with open("0.msg", "rb") as orig_message_file:
message_bytes = orig_message_file.read()
message_parser = BytesParser(policy=cur_policy)
msg = message_parser.parsebytes(message_bytes)
with open("/tmp/1.tmp", "wb") as new_message_file:
message_gen = BytesGenerator(new_message_file, policy=cur_policy)
message_gen.flatten(msg)
It produces the following traceback:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
message_gen.flatten(msg)
~~~~~~~~~~~~~~~~~~~^^^^^
File "/home/serhiy/py/cpython/Lib/email/generator.py", line 115, in flatten
self._write(msg)
~~~~~~~~~~~^^^^^
File "/home/serhiy/py/cpython/Lib/email/generator.py", line 198, in _write
self._write_headers(msg)
~~~~~~~~~~~~~~~~~~~^^^^^
File "/home/serhiy/py/cpython/Lib/email/generator.py", line 421, in _write_headers
self._fp.write(self.policy.fold_binary(h, v))
~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/home/serhiy/py/cpython/Lib/email/policy.py", line 200, in fold_binary
folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/serhiy/py/cpython/Lib/email/policy.py", line 221, in _fold
return self.header_factory(name, ''.join(lines)).fold(policy=self)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
File "/home/serhiy/py/cpython/Lib/email/headerregistry.py", line 253, in fold
return header.fold(policy=policy)
~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 156, in fold
return _refold_parse_tree(self, policy=policy)
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 2849, in _refold_parse_tree
last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
part.ew_combine_allowed, charset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/serhiy/py/cpython/Lib/email/_header_value_parser.py", line 2938, in _fold_as_ew
encoded_word = _ew.encode(to_encode_word, charset=encode_as)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/serhiy/py/cpython/Lib/email/_encoded_words.py", line 222, in encode
bstring = string.encode('ascii', 'surrogateescape')
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode character '\u0421' in position 2: ordinal not in range(128)
Here is a simple self-sufficient example:
import email
from email.policy import SMTPUTF8
msg = email.message_from_bytes(b'''\
Subject: =?utf-8?b?0LDQsdCy0LPSkdC00LXRlNC20LfQuNGW0ZfQudC60LvQvNC90L7Qv9GA0YHR?=
\t=?utf-8?b?gtGD0YTRhdGG0YfRiNGJ0YzRjtGP?=''', policy=SMTPUTF8)
print(msg)
How to create such example: a long non-ascii string was encoded to bytes, the result was split on 45-byte chunks in the middle of the multibyte character, then the chunks were base64 encoded.
When parse the email message, every chunk is base64-decoded, then utf-8-decoded. Since the multibyte character was split between chunks, we get surrogate escapes. Then the generator first checks if the token that contains both non-ascii characters and surrogate escapes can be encoded with the specified encoding (utf-8). It fails dues to surrogate escapes, sets charset to 'unknown-8bit', and tries to encode it with 'ascii' and 'surrogateescape'. And fails.
A possible solution may be to split tokens that contain both non-ascii characters and surrogate escapes on parts that contains only non-ascii characters or surrogate escapes and handle them separately. Or keep separately the declared charset (which can be 'unknown-8bit') and the actual encoding.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-bug', 'expert-email', '3.9']
title = 'Message from BytesParser cannot be flattened immediately'
updated_at =
user = 'https://bugs.python.org/vitas1'
```
bugs.python.org fields:
```python
activity =
actor = 'vitas1'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['email']
creation =
creator = 'vitas1'
dependencies = []
files = ['50178']
hgrepos = []
issue_num = 44694
keywords = []
message_count = 2.0
messages = ['397937', '398109']
nosy_count = 3.0
nosy_names = ['barry', 'r.david.murray', 'vitas1']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue44694'
versions = ['Python 3.9']
```