email library could "recover" from bad mime boundary like (some?) email clients do

3b9747ee-ec17-4fc8-9ae2-99f2a270fd20 commented 8 years ago

BPO	27010
Nosy	@warsaw, @bitdancer, @soltysh, @adepasquale
Files	mail mail.json issue27010-notuniqueboundary.patch: NotUniqueBoundaryInMultipartDefect

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', '3.7', 'expert-email'] title = 'email library could "recover" from bad mime boundary like (some?) email clients do' updated_at = user = 'https://bugs.python.org/FedeleMantuano' ``` bugs.python.org fields: ```python activity = actor = 'Fedele Mantuano' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'Fedele Mantuano' dependencies = [] files = ['42830', '42831', '43016'] hgrepos = [] issue_num = 27010 keywords = ['patch'] message_count = 17.0 messages = ['265413', '265416', '265417', '265419', '265420', '266372', '266382', '266438', '266440', '268558', '268909', '271882', '274878', '275012', '277114', '277136', '277140'] nosy_count = 5.0 nosy_names = ['barry', 'r.david.murray', 'maciej.szulik', 'Fedele Mantuano', 'adepasquale'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue27010' versions = ['Python 3.7'] ```

3b9747ee-ec17-4fc8-9ae2-99f2a270fd20 commented 8 years ago

We are receiving a lot of mail with attachments not detected from email library. I also tested Tika parser and it have the same issue:

mail: http://pastebin.com/kSEJnzSa mail parsed: http://pastebin.com/7HaVPcTq

I can read only these content types: multipart/mixed multipart/alternative text/plain text/html

there isn't Content-Type: application/zip.

With a normal mail client I can read the attachment.

Where is the issue?

bitdancer commented 8 years ago

When you say the attachment is not detected, what do you mean? What call are you making to the email library that you are expecting to see the attachment in that it is not in? Your 'parsed' pastebin isn't something the library produces, so I assume that's the Tika output.

(By the way, pastbin links are problematic in tracker issues, since they may expire. Please paste directly in to the issue, or attach files to the issue.)

Oh, wait. Looking at the email I think I see the problem:

---------------------------------------- \</BODY> \</HTML>

--51a14337d8625bb8ce4a5b1667f--

--51a14337d8625bb8ce4a5b1667f \<attachment content> ----------------------------------------

That line that ends with '--' signals the end of the last MIME part in the message. So by RFC standards the remainder of the message is part of the 'epilogue'. If you check msg.epilogue I think you'll find that it contains the raw text of the remainder of the message.

It is interesting that your email client treats it as an actual attachment. It would be possible to have the email library recognize such out of place mime dividers and register it as an error. I would review a patch for that if someone wants to propose one.

--David

3b9747ee-ec17-4fc8-9ae2-99f2a270fd20 commented 8 years ago

Hi David,

I use email library to detect malicious attachments, so:

message = email.message_from_file(open('mail'))
for i in message.walk():
   do somethings

Not detected means that in for loop I can't see these attachments.

The same problem there is with tika parser (now I attached file).

I think that all automatics tools that using email library can't extract and post analyze these mails.

3b9747ee-ec17-4fc8-9ae2-99f2a270fd20 commented 8 years ago

I test your hypothesis:

for i in message.walk():
    print i.get_content_type()
    print("#################################################################")
    print i.epilogue

multipart/mixed #################################################################

--31a14337d8625bb8ce4a5b1667f Content-Type: application/zip; name="n.41056 0002 02 107413 del 11.05.2016.zip" Content-Transfer-Encoding: base64 Content-ID: \008601d1ac89$01f7f760$0d00a8c0@D25LND1N\

UEsDBBQAAAAIALNQrEi/ST/WbSsBAABAAgAtAAAAbi40MTA0NiAwMDA0IDAyIDEwNzIwMyBk ZWwgMTEuMDUuMjAxNi5wZGYuZXhl7FNnjExRGL1vDAZjZ1Zd0YbookeUIIgRYocdjBq9r766 GG2ZeJ7RrZroJXrvYtUhIUqIXhLEYMJisJJhnPPePjt6+CeZLzvn3nfv+c733XPvOjrNFVmE

And for me it's right.

bitdancer commented 8 years ago

I'm going to change the title of this and see if anyone wants to propose a patch. It'll probably end up getting closed as not a bug if no one does for a while, though.

dc23a991-99e7-40f9-8b7e-062908e583d9 commented 8 years ago

Isn't this covered by the following test case?

Lib/test/test_email/test_defect_handling.py:18

bitdancer commented 8 years ago

Yes. The current behavior is not a bug, the question is, do we want to deal with that XXX comment in the test by detecting the duplicate and reconizing the "extra" mime part? The defect detection would remain.

dc23a991-99e7-40f9-8b7e-062908e583d9 commented 8 years ago

How about the following patch? If it's different from what you had in mind, please let me know.

bitdancer commented 8 years ago

Thanks for the patch. I'll take a look at this during the PyCon sprints.

dc23a991-99e7-40f9-8b7e-062908e583d9 commented 8 years ago

Hello, did you have a chance to look at my patch?

bitdancer commented 8 years ago

Unfortunatley no, things were too busy. I'm hoping to have time to review email patches in the not too distant future, though.

dc23a991-99e7-40f9-8b7e-062908e583d9 commented 8 years ago

Ok thanks, please kindly let me know.

bitdancer commented 8 years ago

Andrea: yes, your patch is different from what I had in mind. The idea would be to recognize the "nested part with duplicate boundary", register the new defect, but produce a Message object with a structure that looked like this:

multipart/mixed multipart/alternative text/plain text/html image/gif

What your patch produces is:

multipart/mixed multipart/alternative text/plain text/html

which is not recognizing the nested multipart or the final MIME part (which is the OPs goal).

In principle it should be possible to parse the nesting despite the bad boundary (other MIME parsers do it, as documented here), but I'm not sure how hard it will be to modify Feedparser to do it. Looking at the code it seems like it shouldn't be that hard to make it work, but I haven't dug deeply enough to be sure.

dc23a991-99e7-40f9-8b7e-062908e583d9 commented 8 years ago

Yes you are right, my patch produces an RFC2046-compliant output and also registers the "not-unique-boundary" defect.

dc23a991-99e7-40f9-8b7e-062908e583d9 commented 8 years ago

To provide additional context, Microsoft has patched his Outlook client to be RFC2046-compliant. More details below:

http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3366 https://technet.microsoft.com/library/security/MS16-107 http://www.certego.net/en/news/badepilogue-the-perfect-evasion/

bitdancer commented 8 years ago

Hmm. Thanks for the links. That[] implies that "fixing" this would be *introducing a security vulnerability...unless one was trying to implement a virus/spam scanner in Python. So perhaps this should be controlled by a policy switch.

[*] The third of those links is the most useful one to read.

3b9747ee-ec17-4fc8-9ae2-99f2a270fd20 commented 8 years ago

I developed a library that can get that malformed email part, but to get it I used the not correct type of defect "StartBoundaryNotFoundDefect" (https://github.com/SpamScope/mail-parser/blob/develop/mailparser/__init__.py#L44). With this patch, I could get malformed email part with the correct defect.

python / cpython

email library could "recover" from bad mime boundary like (some?) email clients do #71197