Open 3b9747ee-ec17-4fc8-9ae2-99f2a270fd20 opened 8 years ago
We are receiving a lot of mail with attachments not detected from email library. I also tested Tika parser and it have the same issue:
mail: http://pastebin.com/kSEJnzSa mail parsed: http://pastebin.com/7HaVPcTq
I can read only these content types: multipart/mixed multipart/alternative text/plain text/html
there isn't Content-Type: application/zip.
With a normal mail client I can read the attachment.
Where is the issue?
When you say the attachment is not detected, what do you mean? What call are you making to the email library that you are expecting to see the attachment in that it is not in? Your 'parsed' pastebin isn't something the library produces, so I assume that's the Tika output.
(By the way, pastbin links are problematic in tracker issues, since they may expire. Please paste directly in to the issue, or attach files to the issue.)
Oh, wait. Looking at the email I think I see the problem:
---------------------------------------- \</BODY> \</HTML>
--51a14337d8625bb8ce4a5b1667f--
--51a14337d8625bb8ce4a5b1667f \<attachment content> ----------------------------------------
That line that ends with '--' signals the end of the last MIME part in the message. So by RFC standards the remainder of the message is part of the 'epilogue'. If you check msg.epilogue I think you'll find that it contains the raw text of the remainder of the message.
It is interesting that your email client treats it as an actual attachment. It would be possible to have the email library recognize such out of place mime dividers and register it as an error. I would review a patch for that if someone wants to propose one.
--David
Hi David,
I use email library to detect malicious attachments, so:
message = email.message_from_file(open('mail'))
for i in message.walk():
do somethings
Not detected means that in for loop I can't see these attachments.
The same problem there is with tika parser (now I attached file).
I think that all automatics tools that using email library can't extract and post analyze these mails.
I test your hypothesis:
for i in message.walk():
print i.get_content_type()
print("#################################################################")
print i.epilogue
multipart/mixed #################################################################
--31a14337d8625bb8ce4a5b1667f Content-Type: application/zip; name="n.41056 0002 02 107413 del 11.05.2016.zip" Content-Transfer-Encoding: base64 Content-ID: \008601d1ac89$01f7f760$0d00a8c0@D25LND1N\
UEsDBBQAAAAIALNQrEi/ST/WbSsBAABAAgAtAAAAbi40MTA0NiAwMDA0IDAyIDEwNzIwMyBk ZWwgMTEuMDUuMjAxNi5wZGYuZXhl7FNnjExRGL1vDAZjZ1Zd0YbookeUIIgRYocdjBq9r766 GG2ZeJ7RrZroJXrvYtUhIUqIXhLEYMJisJJhnPPePjt6+CeZLzvn3nfv+c733XPvOjrNFVmE
And for me it's right.
I'm going to change the title of this and see if anyone wants to propose a patch. It'll probably end up getting closed as not a bug if no one does for a while, though.
Isn't this covered by the following test case?
Yes. The current behavior is not a bug, the question is, do we want to deal with that XXX comment in the test by detecting the duplicate and reconizing the "extra" mime part? The defect detection would remain.
How about the following patch? If it's different from what you had in mind, please let me know.
Thanks for the patch. I'll take a look at this during the PyCon sprints.
Hello, did you have a chance to look at my patch?
Unfortunatley no, things were too busy. I'm hoping to have time to review email patches in the not too distant future, though.
Ok thanks, please kindly let me know.
Andrea: yes, your patch is different from what I had in mind. The idea would be to recognize the "nested part with duplicate boundary", register the new defect, but produce a Message object with a structure that looked like this:
multipart/mixed multipart/alternative text/plain text/html image/gif
What your patch produces is:
multipart/mixed multipart/alternative text/plain text/html
which is not recognizing the nested multipart or the final MIME part (which is the OPs goal).
In principle it should be possible to parse the nesting despite the bad boundary (other MIME parsers do it, as documented here), but I'm not sure how hard it will be to modify Feedparser to do it. Looking at the code it seems like it shouldn't be that hard to make it work, but I haven't dug deeply enough to be sure.
Yes you are right, my patch produces an RFC2046-compliant output and also registers the "not-unique-boundary" defect.
To provide additional context, Microsoft has patched his Outlook client to be RFC2046-compliant. More details below:
http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3366 https://technet.microsoft.com/library/security/MS16-107 http://www.certego.net/en/news/badepilogue-the-perfect-evasion/
Hmm. Thanks for the links. That[] implies that "fixing" this would be *introducing a security vulnerability...unless one was trying to implement a virus/spam scanner in Python. So perhaps this should be controlled by a policy switch.
[*] The third of those links is the most useful one to read.
I developed a library that can get that malformed email part, but to get it I used the not correct type of defect "StartBoundaryNotFoundDefect" (https://github.com/SpamScope/mail-parser/blob/develop/mailparser/__init__.py#L44). With this patch, I could get malformed email part with the correct defect.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', '3.7', 'expert-email']
title = 'email library could "recover" from bad mime boundary like (some?) email clients do'
updated_at =
user = 'https://bugs.python.org/FedeleMantuano'
```
bugs.python.org fields:
```python
activity =
actor = 'Fedele Mantuano'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['email']
creation =
creator = 'Fedele Mantuano'
dependencies = []
files = ['42830', '42831', '43016']
hgrepos = []
issue_num = 27010
keywords = ['patch']
message_count = 17.0
messages = ['265413', '265416', '265417', '265419', '265420', '266372', '266382', '266438', '266440', '268558', '268909', '271882', '274878', '275012', '277114', '277136', '277140']
nosy_count = 5.0
nosy_names = ['barry', 'r.david.murray', 'maciej.szulik', 'Fedele Mantuano', 'adepasquale']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue27010'
versions = ['Python 3.7']
```