Exception: Unknown Content-Transfer-Encoding: "7bit;"

lucc commented 6 years ago

Describe the bug I open a mail from search mode and get an exception Unknown Content-Transfer-Encoding: "7bit;".

Software Versions

Python version: 3.7.0 (arch linux)
Notmuch version: 0.27 (arch linux)
Alot version: master (0.7-262-g536f39b0)

To Reproduce

Take this mail:


Date: 6 Nov 2016 20:44:35 +0100
Message-ID: <7bit-exception@alot.dev>
To: me
Subject: 7bit Tränsfer Encöding with lätin1 in bödy and sübject
X-PHP-Originating-Script: 10000:lib.php
From: sender
MIME-Version: 1.0;
Content-Type: multipart/mixed; boundary="54ddea76248a53cb2812bf32e5c2b160";
Envelope-To: test@alot.dev

--54ddea76248a53cb2812bf32e5c2b160 Content-Type: text/plain; charset=iso-8859-1; Content-Transfer-Encoding: 7bit;

Föö Bär Bäz

--54ddea76248a53cb2812bf32e5c2b160

- Make sure it is encoded as latin1.
- Add it to the mail store.
- start alot by searching for this mail: `alot --debug-level debug --logfile alot.log search id:7bit-exception@alot.dev`
- press enter. See the exception below in your log.

**Error Log**
(Probably unrelated but still interesting: Notice that the subject header is readable in search mode but is mangled in the debug output when decoding the headers for the thread buffer.)

DEBUG:ui:Got key (['enter'], [10]) DEBUG:ui:cmdline: 'select' DEBUG:ui:search command string: "select" DEBUG:init:mode:search got commandline "select" DEBUG:init:ARGS: ['select'] DEBUG:init:cmd parms {} INFO:search:open thread view for thread:00000000000050bc: 7bit Tränsfer Encöding with lätin1 in bödy and sübject DEBUG:utils:unquoted header: |sender| DEBUG:utils:unquoted header: |sender| DEBUG:utils:unquoted header: |me| DEBUG:utils:unquoted header: |7bit Tr�nsfer Enc�ding with l�tin1 in b�dy and s�bject| ERROR:ui:Traceback (most recent call last): File "/home/luc/vcs/alot/alot/ui.py", line 711, in apply_command cmd.apply(self) File "/home/luc/vcs/alot/alot/commands/search.py", line 41, in apply sb = buffers.ThreadBuffer(ui, self.thread) File "/home/luc/vcs/alot/alot/buffers/thread.py", line 37, in init self.rebuild() File "/home/luc/vcs/alot/alot/buffers/thread.py", line 65, in rebuild self._tree = ThreadTree(self.thread) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 352, in init accumulate(msg) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 331, in accumulate self._message[mid] = MessageTree(msg, odd) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 172, in init self._maintree = SimpleTree(self._assemble_structure()) File "/home/luc/vcs/alot/alot/widgets/thread.py", line 205, in _assemble_structure bodytree = self._get_body() File "/home/luc/vcs/alot/alot/widgets/thread.py", line 238, in _get_body bodytxt = self._message.accumulate_body() File "/home/luc/vcs/alot/alot/db/message.py", line 266, in accumulate_body bodytext = extract_body(eml) File "/home/luc/vcs/alot/alot/db/utils.py", line 468, in extract_body body_parts.append(string_sanitize(remove_cte(part, as_string=True))) File "/home/luc/vcs/alot/alot/db/utils.py", line 413, in remove_cte 'Unknown Content-Transfer-Encoding: "{}"'.format(cte)) Exception: Unknown Content-Transfer-Encoding: "7bit;"

DEBUG:ui:Got key (['q'], [81]) DEBUG:ui:cmdline: 'exit' DEBUG:ui:search command string: "exit" DEBUG:init:mode:search got commandline "exit" DEBUG:init:ARGS: ['exit'] DEBUG:init:cmd parms {} DEBUG:ui:Got key (['y'], [121]) DEBUG:globals:flush complete

pazz commented 5 years ago

Is the problem here the trailing semicolon in the header value Content-Transfer-Encoding: "7bit;" ?

lucc commented 5 years ago

It doesn't look like it. I just updated #1321 and manually tried to remove the semicolon but the number of expected failures stays the same.

pazz commented 5 years ago

strange. It looks as if the semi-colon is a delimeter between optional parameters in the header values, and that the content-transfer-encoding value should not include this semi-colon: https://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html

In fact, all examples i found on the web have no trailing semo-colon.

pazz commented 5 years ago

I cannot reproduce using you corpus branch. If i reomve the "expected failure" decorator in the test, i get this:

 python3 -m unittest tests.db.utils_test.TestRemoveCte.test_issue_1301             
E
======================================================================
ERROR: test_issue_1301 (tests.db.utils_test.TestRemoveCte)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pazz/projects/alot/tests/db/utils_test.py", line 747, in test_issue_1301
    mail = email.message_from_file(fp)
  File "/usr/lib/python3.6/email/__init__.py", line 54, in message_from_file
    return Parser(*args, **kws).parse(fp)
  File "/usr/lib/python3.6/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 16: invalid continuation byte

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

lucc commented 5 years ago

So the test file is broken in some other way as well ... That means I have to revisit this thoroughly.

pazz commented 5 years ago

But somehow I wonder why the email module tries to parse this as UTF8 and not, as indicated by the header, as ASCII (7bit).

pazz commented 5 years ago

The testmail is problematic for an unrelated reason: quoting the wiki:

Since RFC 2822, conforming message header names and values should be ASCII characters; values that contain non-ASCII data should use the MIME encoded-word syntax (RFC 2047) instead of a literal string.

The subject is neither ascii nor encoded-word.

Also, playing around with it a bit:

In [17]: f=open('tests/static/mail/latin-1.eml')

In [18]: f.read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-571e9fb02258> in <module>()
----> 1 f.read()

/usr/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 16: invalid continuation byte

You can of course add errors='ignore' to the call to open, but this just replaces the umlauts with spaces.

But look what works nicely:

In [47]: f=open('tests/static/mail/latin-1.eml','b+r')

In [48]: with open('tests/static/mail/latin-1.eml','b+r') as b:
    ...:     m = email.message_from_bytes(b.read())
    ...:     

In [49]: m.get_payload()
Out[49]: 'Föö Bär Bäz\n'

strangely, what does not work is

email.message_from_file(open('tests/static/mail/latin-1.eml'))
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
...
/usr/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 16: invalid continuation byte

Bottom line: I believe that we should use open('tests/static/mail/latin-1.eml','b+r') when opening email files in tests (and perhaps also in the UI).

pazz commented 5 years ago

@lucc: I can now reproduce this (if you open the message in binary mode, see this branch). If you remove the expected-failure-decorator you can trigger the issue wth

python3 -m unittest tests.db.utils_test.TestRemoveCte.test_issue_1301

I am almost certain that the trailing semi-colon is the problem though, meaning that this message is actually malformed and should not occurr in the wild.

I do not remember why we actually needed utils.remove_cte. It seems troublesome and possibly outdated now that python 3's email module has been overhauled? @dcbaker ?

pazz commented 5 years ago

I am not convinced that the original issue is caused by an unexpected trailing semi-colon. This issue appears multiple times around the web. I have amended your original test case to reflect this in my fork (#1342) of your original PR, @lucc . I am happy with the expected failure (plus test) as it is now but we can of course discuss if it'd be appropriate to be lenient and, as some other projects, interpret "7bit;" as "7bit".

lucc commented 5 years ago

I think it might be reasonable to be more lenient in this case.

I was pondering about how to find other potential problematic mails and came up with this:

cd ~/vcs/alot
mkdir logs
touch empty
for thread in $(notmuch search --output=threads '*'); do
  PYTHONPATH=. \
    python -m alot \
      --debug-level debug \
      --logfile ./logs/$thread.log \
      --config empty \
      search $thread ';' select ';' call 'os._exit(0)'
done

This will open each thread in your notmuch database in an alot instance and save the log to a file. Attention: This will take a long time!! For me it runs at about 110 threads per minute. (Maybe this can be improved upon but I will just keep the computer running overnight I think.) You can then search the log files with grep -vER '^(DEBUG|INFO)' logs and can extract problematic messages from the hits with notmuch seach --output=files thread:000....

lucc commented 5 years ago

Ok it "only" took 2.5 hours for 16k threads.I found 64 threads that throw an exception. These are some of the interesting ones:

AssertionError: b'0'
AttributeError: 'Group' object has no attribute 'local_part'
AttributeError: 'NoneType' object has no attribute 'selectable'
binascii.Error: Invalid base64-encoded string: number of data characters (18745) cannot be 1 more than a multiple of 4
IndexError: string index out of range
LookupError: unknown encoding: unknown-8bit
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 129: ordinal not in range(128)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 195: invalid start byte
ValueError: Unknown Content-Transfer-Encoding: "7bit;"

I will try to extract exemplary messages and send PRs for our new mail corpus.

Whoever is motivated to do the same with their mail store is invited to do so.

pazz commented 5 years ago

Very very nice idea to for bulk testing! looking forward to your new test cases. (but let's release v0.8 before we adress them) :)

lucc commented 5 years ago

I don't mind. We can also address all of these in 0.8.1.

You originally added this issue to 0.8 and I agree that one should fix known issues before releasing a new version but if we postpone the other we could also postpone this issue for 0.8.1. I don't care.

pazz commented 5 years ago

I'm running it right now and am already loving what I see:

grep -ER 'ValueError' logs
logs/thread:000000000000f27b.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printablee"
logs/thread:000000000000f04c.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printablessss"
logs/thread:00000000000150c2.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printabley"
logs/thread:0000000000016149.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printabledfg"
logs/thread:0000000000015571.log:ValueError: Unknown Content-Transfer-Encoding: "quoted-printablesdf"

All spam mail btw. I sense a pattern :)

pazz commented 5 years ago

I've written a little script to get rid of duplicates as much as possible. See here. For the script and a bunch of unique tracebacks I extracted from my database. Hopefully not all of them will end up being unique issues..

Regarding this issue here, I agree that we could be more lenient, so I propose that we

use substring tests, just like if "7bit" in cte: ... in alot.db.utils.remove_cte, and
update our tests before milestone 0.8.

lucc commented 5 years ago

I would rather interpret the "7bit;" as a mis-implementation of the Content-Transfer-Encoding header which is guided by the Content-Type where a semicolon is used to separate different parts of the header value. Just in this case there is only one field so the "field" after the semicolon is empty.

The examples of spam mail above already demonstrate why a simple substring search might be problematic.

Instead I suggest to call .strip("; ") before comparing.

By "update out tests" do you mean to introduce (failing?) test cases for all the stuff we find when searching our whole mail archives for exceptions?

pazz commented 5 years ago

Quoting Lucas Hoffmann (2018-12-04 20:47:17)

I would rather interpret the "7bit;" as a mis-implementation of the Content-Transfer-Encoding header which is guided by the Content-Type where a semicolon is used to separate different parts of the header value. Just in this case there is only one field so the "field" after the semicolon is empty.

I don't get it

The examples of spam mail above already demonstrate why a simple substring search might be problematic.

actually I think my examples show that a substring check for "quoted-printable" makes sense: after all the following payloads are more likely to contain quoted-printable than 7bit. I was suggesting to use a subset-check for all known types, interpret the first one that matches, and ignore all resulting encoding errors. Plus report in the logs if the CTE value is bad.

Instead I suggest to call .strip("; ") before comparing.

How would that leave the "quoted-printableee" mails?

By "update out tests" do you mean to introduce (failing?) test cases for all the stuff we find when searching our whole mail archives for exceptions?

s/out/our/g sorry. I meant to update the current test for this issue, which succeeds in case a ValueError is raised. Instead, if my above proposal is interpreted, this test should fail :)

pazz commented 5 years ago

the first commit also "fixes" #1291 as discussed in that issue, by adding the 'ignore' error handler

pazz commented 5 years ago

(merged)

pazz / alot

Exception: Unknown Content-Transfer-Encoding: "7bit;" #1301