Closed lucc closed 5 years ago
Is the problem here the trailing semicolon in the header value Content-Transfer-Encoding: "7bit;"
?
It doesn't look like it. I just updated #1321 and manually tried to remove the semicolon but the number of expected failures stays the same.
strange. It looks as if the semi-colon is a delimeter between optional parameters in the header values, and that the content-transfer-encoding value should not include this semi-colon: https://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html
In fact, all examples i found on the web have no trailing semo-colon.
I cannot reproduce using you corpus branch. If i reomve the "expected failure" decorator in the test, i get this:
python3 -m unittest tests.db.utils_test.TestRemoveCte.test_issue_1301
E
======================================================================
ERROR: test_issue_1301 (tests.db.utils_test.TestRemoveCte)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/pazz/projects/alot/tests/db/utils_test.py", line 747, in test_issue_1301
mail = email.message_from_file(fp)
File "/usr/lib/python3.6/email/__init__.py", line 54, in message_from_file
return Parser(*args, **kws).parse(fp)
File "/usr/lib/python3.6/email/parser.py", line 54, in parse
data = fp.read(8192)
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 16: invalid continuation byte
----------------------------------------------------------------------
Ran 1 test in 0.001s
FAILED (errors=1)
So the test file is broken in some other way as well ... That means I have to revisit this thoroughly.
But somehow I wonder why the email module tries to parse this as UTF8 and not, as indicated by the header, as ASCII (7bit).
The testmail is problematic for an unrelated reason: quoting the wiki:
Since RFC 2822, conforming message header names and values should be ASCII characters; values that contain non-ASCII data should use the MIME encoded-word syntax (RFC 2047) instead of a literal string.
The subject is neither ascii nor encoded-word.
Also, playing around with it a bit:
In [17]: f=open('tests/static/mail/latin-1.eml')
In [18]: f.read()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-18-571e9fb02258> in <module>()
----> 1 f.read()
/usr/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 16: invalid continuation byte
You can of course add errors='ignore'
to the call to open
, but this just replaces the umlauts with spaces.
But look what works nicely:
In [47]: f=open('tests/static/mail/latin-1.eml','b+r')
In [48]: with open('tests/static/mail/latin-1.eml','b+r') as b:
...: m = email.message_from_bytes(b.read())
...:
In [49]: m.get_payload()
Out[49]: 'Föö Bär Bäz\n'
strangely, what does not work is
email.message_from_file(open('tests/static/mail/latin-1.eml'))
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
...
/usr/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 16: invalid continuation byte
Bottom line: I believe that we should use open('tests/static/mail/latin-1.eml','b+r')
when opening email files in tests (and perhaps also in the UI).
@lucc: I can now reproduce this (if you open the message in binary mode, see this branch). If you remove the expected-failure-decorator you can trigger the issue wth
python3 -m unittest tests.db.utils_test.TestRemoveCte.test_issue_1301
I am almost certain that the trailing semi-colon is the problem though, meaning that this message is actually malformed and should not occurr in the wild.
I do not remember why we actually needed utils.remove_cte
. It seems troublesome and possibly outdated now that python 3's email module has been overhauled? @dcbaker ?
I am not convinced that the original issue is caused by an unexpected trailing semi-colon. This issue appears multiple times around the web. I have amended your original test case to reflect this in my fork (#1342) of your original PR, @lucc . I am happy with the expected failure (plus test) as it is now but we can of course discuss if it'd be appropriate to be lenient and, as some other projects, interpret "7bit;" as "7bit".
I think it might be reasonable to be more lenient in this case.
I was pondering about how to find other potential problematic mails and came up with this:
cd ~/vcs/alot
mkdir logs
touch empty
for thread in $(notmuch search --output=threads '*'); do
PYTHONPATH=. \
python -m alot \
--debug-level debug \
--logfile ./logs/$thread.log \
--config empty \
search $thread ';' select ';' call 'os._exit(0)'
done
This will open each thread in your notmuch database in an alot instance and save the log to a file. Attention: This will take a long time!! For me it runs at about 110 threads per minute. (Maybe this can be improved upon but I will just keep the computer running overnight I think.) You can then search the log files with grep -vER '^(DEBUG|INFO)' logs
and can extract problematic messages from the hits with notmuch seach --output=files thread:000...
.
Ok it "only" took 2.5 hours for 16k threads.I found 64 threads that throw an exception. These are some of the interesting ones:
AssertionError: b'0'
AttributeError: 'Group' object has no attribute 'local_part'
AttributeError: 'NoneType' object has no attribute 'selectable'
binascii.Error: Invalid base64-encoded string: number of data characters (18745) cannot be 1 more than a multiple of 4
IndexError: string index out of range
LookupError: unknown encoding: unknown-8bit
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 129: ordinal not in range(128)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 195: invalid start byte
ValueError: Unknown Content-Transfer-Encoding: "7bit;"
I will try to extract exemplary messages and send PRs for our new mail corpus.
Whoever is motivated to do the same with their mail store is invited to do so.
Very very nice idea to for bulk testing! looking forward to your new test cases. (but let's release v0.8 before we adress them) :)
I don't mind. We can also address all of these in 0.8.1.
You originally added this issue to 0.8 and I agree that one should fix known issues before releasing a new version but if we postpone the other we could also postpone this issue for 0.8.1. I don't care.
I'm running it right now and am already loving what I see:
grep -ER 'ValueError' logs
logs/thread:000000000000f27b.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printablee"
logs/thread:000000000000f04c.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printablessss"
logs/thread:00000000000150c2.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printabley"
logs/thread:0000000000016149.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printabledfg"
logs/thread:0000000000015571.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printablesdf"
All spam mail btw. I sense a pattern :)
I've written a little script to get rid of duplicates as much as possible. See here. For the script and a bunch of unique tracebacks I extracted from my database. Hopefully not all of them will end up being unique issues..
Regarding this issue here, I agree that we could be more lenient, so I propose that we
if "7bit" in cte:
... in alot.db.utils.remove_cte
, and I would rather interpret the "7bit;" as a mis-implementation of the Content-Transfer-Encoding header which is guided by the Content-Type where a semicolon is used to separate different parts of the header value. Just in this case there is only one field so the "field" after the semicolon is empty.
The examples of spam mail above already demonstrate why a simple substring search might be problematic.
Instead I suggest to call .strip("; ")
before comparing.
By "update out tests" do you mean to introduce (failing?) test cases for all the stuff we find when searching our whole mail archives for exceptions?
Quoting Lucas Hoffmann (2018-12-04 20:47:17)
I would rather interpret the "7bit;" as a mis-implementation of the Content-Transfer-Encoding header which is guided by the Content-Type where a semicolon is used to separate different parts of the header value. Just in this case there is only one field so the "field" after the semicolon is empty.
I don't get it
The examples of spam mail above already demonstrate why a simple substring search might be problematic.
actually I think my examples show that a substring check for "quoted-printable" makes sense: after all the following payloads are more likely to contain quoted-printable than 7bit. I was suggesting to use a subset-check for all known types, interpret the first one that matches, and ignore all resulting encoding errors. Plus report in the logs if the CTE value is bad.
Instead I suggest to call
.strip("; ")
before comparing.
How would that leave the "quoted-printableee" mails?
By "update out tests" do you mean to introduce (failing?) test cases for all the stuff we find when searching our whole mail archives for exceptions?
s/out/our/g sorry. I meant to update the current test for this issue, which succeeds in case a ValueError is raised. Instead, if my above proposal is interpreted, this test should fail :)
the first commit also "fixes" #1291 as discussed in that issue, by adding the 'ignore' error handler
(merged)
Describe the bug I open a mail from search mode and get an exception
Unknown Content-Transfer-Encoding: "7bit;"
.Software Versions
To Reproduce
--54ddea76248a53cb2812bf32e5c2b160 Content-Type: text/plain; charset=iso-8859-1; Content-Transfer-Encoding: 7bit;
Föö Bär Bäz
--54ddea76248a53cb2812bf32e5c2b160
DEBUG:ui:Got key (['enter'], [10]) DEBUG:ui:cmdline: 'select' DEBUG:ui:search command string: "select" DEBUG:init:mode:search got commandline "select" DEBUG:init:ARGS: ['select'] DEBUG:init:cmd parms {} INFO:search:open thread view for thread:00000000000050bc: 7bit Tränsfer Encöding with lätin1 in bödy and sübject DEBUG:utils:unquoted header: |sender| DEBUG:utils:unquoted header: |sender| DEBUG:utils:unquoted header: |me| DEBUG:utils:unquoted header: |7bit Tr�nsfer Enc�ding with l�tin1 in b�dy and s�bject| ERROR:ui:Traceback (most recent call last): File "/home/luc/vcs/alot/alot/ui.py", line 711, in apply_command cmd.apply(self) File "/home/luc/vcs/alot/alot/commands/search.py", line 41, in apply sb = buffers.ThreadBuffer(ui, self.thread) File "/home/luc/vcs/alot/alot/buffers/thread.py", line 37, in init self.rebuild() File "/home/luc/vcs/alot/alot/buffers/thread.py", line 65, in rebuild self._tree = ThreadTree(self.thread) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 352, in init accumulate(msg) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 331, in accumulate self._message[mid] = MessageTree(msg, odd) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 172, in init self._maintree = SimpleTree(self._assemble_structure()) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 205, in _assemble_structure bodytree = self._get_body() File "/home/luc/vcs/alot/alot/widgets/thread.py", line 238, in _get_body bodytxt = self._message.accumulate_body() File "/home/luc/vcs/alot/alot/db/message.py", line 266, in accumulate_body bodytext = extract_body(eml) File "/home/luc/vcs/alot/alot/db/utils.py", line 468, in extract_body body_parts.append(string_sanitize(remove_cte(part, as_string=True))) File "/home/luc/vcs/alot/alot/db/utils.py", line 413, in remove_cte 'Unknown Content-Transfer-Encoding: "{}"'.format(cte)) Exception: Unknown Content-Transfer-Encoding: "7bit;"
DEBUG:ui:Got key (['q'], [81]) DEBUG:ui:cmdline: 'exit' DEBUG:ui:search command string: "exit" DEBUG:init:mode:search got commandline "exit" DEBUG:init:ARGS: ['exit'] DEBUG:init:cmd parms {} DEBUG:ui:Got key (['y'], [121]) DEBUG:globals:flush complete