jstedfast / MimeKit

A .NET MIME creation and parser library with support for S/MIME, PGP, DKIM, TNEF and Unix mbox spools.
http://www.mimekit.net
MIT License
1.83k stars 372 forks source link

Failed to extract TextBody/HtmlBody for a somewhat mailformed Multipart/Alternative mail #692

Closed rboen closed 3 years ago

rboen commented 3 years ago

Describe the bug I tried to get the HTML or TEXT body of the email below, but the MimeMessage.TextBody and the MimeMessage.HtmlBody returns null.

After some digging into the mail structure I noticed that the Boundary_(ID_EFTfpjUnoKOKbUchO46n4w) is nowhere to be seen in the email but a second boundary in a Content-Type: multipart/alternative header is given. This boundary is present.

The parser seems to grab the first invalid Content-Type-Header and will not get the internal structure.

I assume this is an incorrect mail format, but this seems to happen in the wild (I assume the virus protection messes the mail structure).

A mail client e.g. Outlook is somewhat forgiving and shows the content nonetheless.

I tried to create a workaround but I quickly discovered that some helper classes/methods are declared internal. So I am stuck.

Thank you very much for looking into this.

Expected behavior MimeMessage.HtmlBody / MimeMessage.TextBody should return body content of the mail.

Screenshots

From: John Doe <john@foobar.net.il>
Date: Thu, 29 Jul 2021 07:46:01 +0300
Subject: RE: subject
Message-Id: <017e01d78434$a18012d0$e4803870$@foobar.net.il>
To: info@someserver.de
Return-Path: <info@someserver>
Resent-From: info@someserver
Subject: RE: subject
In-reply-to: 
Message-id: <017e01d78434$a18012d0$e4803870$@foobar.il>
MIME-version: 1.0
X-Mailer: Microsoft Outlook 16.0
Thread-index: Add41XPhHzNfrvGkRJ2TYsozzI9LOQLXuPZA
X-Antivirus: Avast (VPS 210729-0, 7/29/2021), Outbound message
X-Antivirus-Status: Clean
References:
X-EOPAttributedMessage: 0
X-EOPTenantAttributedMessage: 44d81603-6a98-429d-8ab9-bb947b669639:0
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: eea4ffa7-7fd9-42d0-4e7b-08d9524bc48c
X-MS-TrafficTypeDiagnostic: FRYP281MB0583:
X-LD-Processed: 44d81603-6a98-429d-8ab9-bb947b669639,ExtFwd
X-MS-Exchange-Transport-Forked: True
X-Microsoft-Antispam-PRVS: =?utf-8?q?=3CFRYP281MB058343080804D92271C3A1D8E6EB9=40FRYP281MB0583=2EDEUP281?=
 =?utf-8?q?=2EPROD=2EOUTLOOK=2ECOM=3E?=
X-MS-Oob-TLC-OOBClassifiers: OLM:6108;
X-MS-Exchange-SenderADCheck: 0
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: =?utf-8?q?JOmZ=2FsCm2X+QgnzHElzIVUYuE71YCZKgZrH90cQGqenMs1zwCDY0k4V7mHNii?=
 =?utf-8?q?tAMIfrCv4eluiE1mqaS82GDEkhlGuFB18RFNecBMs6JyqJ+nofe4e19AFCnm3?=
 =?utf-8?q?PwruBn3VdUmpXB0vnlgu4u6wy1jpcrHz8H64gTSIhDjWquqem9+SWjqZ0QvCO?=
 =?utf-8?q?cOiM9s4jBLn4TKw1SYlb97MpihwAYuD6rIDdH3jQG5mtRxj9YNCgNFTo6KumD?=
 =?utf-8?q?13alldCfqkaJ25FEnNdlvDwal7pUz7CjA9AIiyD0SYjs4L9tpC48C=2F7zW+9mH?=
 =?utf-8?q?s+MJUyi5gYGdiueE1T8KKW73lLDAUmXY4vR+WP0c1KGmhxOVEahyORh29H6vp?=
 =?utf-8?q?i28USMBVQ5h0QdigEzrjI0goTORAOqA8RE6VNcH9F+ZhVNH4vmTsLT4FtvSbf?=
 =?utf-8?q?mlFCDxqPv2v9aiKwaf2DzOzp3+uoKxYiKfA3IKTKw7brVDz2WAAgL+HodRSGb?=
 =?utf-8?q?ULwDAf8Zb0GBtBWvZwclibsS8fzOACTe8+IXdW8X58WJ5fbeUIOXDzS9lbpJM?=
 =?utf-8?q?g1zIvb+9eBeSdeHKtm5P45giWOXO=2F41p=2FpfAjpATtmV9wp0NvdLfCMFP6=2FXJj?=
 =?utf-8?q?ZSzYu19GUJOP5DusRK0p7ejlBjdYmZeYbP6vD6glAzjTQGNquwgJBAwVr+Hek?=
 =?utf-8?q?DdYT84E4qnaU4WQuCe7jc47+vng9CmQpi?=
X-Forefront-Antispam-Report: =?utf-8?q?CIP=3A82=2E102=2E144=2E78=3BCTRY=3AIL=3BLANG=3Aen=3BSCL=3A1=3BSRV=3A=3BIPV=3ANLI=3BSFV=3ANSPM?=
 =?utf-8?q?=3BH=3Amtaout66=2E012=2Enet=2Eil=3BPTR=3Amtaout66=2E012=2Enet=2Eil=3BCAT=3ANONE=3BSFS=3A=28?=
 =?utf-8?q?396003=29=28136003=29=2839830400003=29=28376002=29=28346002=29=2853546011=29=28863620?=
 =?utf-8?q?01=29=282616005=29=2868406010=29=28956004=29=2870586007=29=285660300002=29=287596003=29?=
 =?utf-8?q?=282906002=29=288676002=29=28498600001=29=28356005=29=2844736005=29=28166002=29=2834206?=
 =?utf-8?q?002=29=2836756003=29=2866574015=29=2826005=29=28316002=29=28966005=29=2883380400001=29=28?=
 =?utf-8?q?1420700001=29=28336012=29=3BDIR=3AOUT=3BSFP=3A1102=3B?=
X-ExternalRecipientOutboundConnectors: 44d81603-6a98-429d-8ab9-bb947b669639
X-MS-Exchange-ForwardingLoop: info@gomolzig.de;44d81603-6a98-429d-8ab9-bb947b669639
X-OriginatorOrg: foobar.de
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jul 2021 04:46:02.1598 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: eea4ffa7-7fd9-42d0-4e7b-08d9524bc48c
X-MS-Exchange-CrossTenant-Id: 44d81603-6a98-429d-8ab9-bb947b669639
X-MS-Exchange-CrossTenant-AuthSource: FR2DEU01FT016.eop-deu01.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: Internet
X-MS-Exchange-Transport-CrossTenantHeadersStamped: FRYP281MB0583
Content-type: multipart/alternative;
    boundary="Boundary_(ID_EFTfpjUnoKOKbUchO46n4w)"
Content-language: en-us
X-Spam-Score: 0.000
Content-Type: multipart/alternative; boundary="=-A94aVRFqe/wS4+SXRzIGBg=="

--=-A94aVRFqe/wS4+SXRzIGBg==
Content-Type: text/plain; charset=utf-8

Dear Sir,

xxx

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

--=-A94aVRFqe/wS4+SXRzIGBg==
Content-Type: text/html; charset=utf-8

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;}
span.EmailStyle19
    {mso-style-type:personal-reply;
    font-family:"Calibri",sans-serif;
    color:windowtext;}
.MsoChpDefault
    {mso-style-type:export-only;
    font-size:10.0pt;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link="#0563C1" vlink="#954F72" style='word-wrap:break-word'><div class=WordSection1><p class=MsoNormal><span style='font-size:12.0pt'>Dear Sir,<o:p></o:p></span></p>
<table style="border-top: 1px solid #D3D4DE;">
    <tr>
        <td style="width: 55px; padding-top: 13px;"><a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=icon" target="_blank"><img src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif" alt="" width="46" height="29" style="width: 46px; height: 29px;" /></a></td>
        <td style="width: 470px; padding-top: 12px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">Virus-free. <a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient&utm_term=link" target="_blank" style="color: #4453ea;">www.avast.com</a>
        </td>
    </tr>
</table><a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"> </a></div></body></html>

--=-A94aVRFqe/wS4+SXRzIGBg==--

Additional context Add any other context about the problem here.

jstedfast commented 3 years ago

I'm not sure how to "fix" this...

What were you planning to do as a work-around?

rboen commented 3 years ago

My first thought was to check in case of a null HtmlBody/TextBody-Property and no BodyParts given to check the MimeMessage.Body.Preamble Content. If this content will start/end with a boundary like string I would have checked, if this boundary fits to one of the Content-Type-Header records. If so, then take the PreampleContent and try to create a Multipart/Alternative-Message Part and use the MultipartAlternative.GetTextBody function. But it felt rather "hacky". Now I see, that GetTextBody is public and not internal as I thought. So there might be a way to create this workaround.

A second idea is to make a change in GetContentType of the MimeParser class. Instead of finding a content type from the start of the header list, a reverse order might make some sense. This idea is based on the assumption that postprocessing modules like anti virus scanner (which have a higher risk to "fiddle" with the message) will append Header-Values. A least in this example the last Content-Type-Header is the "right one".

e.g. like

ContentType GetContentType (ContentType parent)
        {
            for (int i = headers.Count-1; i >= 0; i--) {
                if (!headers[i].Field.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
                    continue;
...

Just a guess - I can't say if multiple content-type-headers are allowed or occurr often or if their order matters...

jstedfast commented 3 years ago

Checking the Preamble might work as a workaround in your case...

As far as Content-Type headers, there should only be 1. In your case, it likely was the anti-virus software that generated a new boundary and, instead of replacing the old Content-Type header, it just appended a new one. Oof.

I'm not sure if using the last Content-Type header is necessarily any more likely to work than the first when there are multiple Content-Type headers (in cases other than yours, I mean). I would need more data.

rboen commented 3 years ago

Unfortunately I cannot provide more data. Here is the workaround for this kind of maleformed mime messages.

public string HtmlBody
        {
            get
            {
                var htmlBody = _decodedMimeMessage.HtmlBody;

                // workaround for maleformed e-mails with multiple Content-Type-Headers of type multipart/alternative
                // where only the given boundary of the last Content-Type header is valid.
                try
                {
                    if (htmlBody == null && _decodedMimeMessage.BodyParts.FirstOrDefault() == null)
                    {
                        var lastContentType = _decodedMimeMessage.Body.Headers.LastOrDefault(h =>
                            h.Field.Equals("Content-Type", StringComparison.OrdinalIgnoreCase));
                        if (lastContentType != null && _decodedMimeMessage.Body is MultipartAlternative multipartAlternative)
                        {
                            var content = multipartAlternative.Preamble;
                            if (ContentType.TryParse(new ParserOptions(), lastContentType.RawValue, 0,
                                lastContentType.RawValue.Length, out var contentType))
                            {
                                var encoding = contentType.CharsetEncoding ?? Encoding.UTF8;
                                using (var bufferStream =
                                    new MemoryStream(encoding.GetBytes(content)))
                                {
                                    var mimeEntity = MimeEntity.Load(contentType, bufferStream);
                                    htmlBody = (mimeEntity as MultipartAlternative)?.HtmlBody;
                                }
                            }
                        }
                    }
                }
                catch
                {
                    // ignore
                }

                return htmlBody;
            }
        }
jstedfast commented 3 years ago

You might find this interesting: https://datatracker.ietf.org/doc/html/rfc7103#section-7.5

jstedfast commented 3 years ago

Unfortunately, they do not address what to do with multiple Content-Type headers.

jstedfast commented 3 years ago

I tried searching for multiple Content-Type headers wrt Avast. All I could find so far are these posts:

https://forum.avast.com/index.php?topic=42013.0 https://forum.avast.com/index.php?topic=57720.0 https://forum.avast.com/index.php?topic=64839.0 https://forum.avast.com/index.php?topic=68497.0

They all seem to indicate that Avast emits some sort of error for messages that it finds containing multiple Content-Type headers but nothing about Avast adding a second Content-Type header.

That said, if Avast decided this was Clean (as per the header), then that suggests it probably did add the second Content-Type header? Maybe?

Do you have any control over the Avast settings? Can you turn off any options that tell it to modify the message body?

rboen commented 3 years ago

Sorry, I cannot provide more information. We implemented some kind of collaboration tool, where MimeKit is used to analyze incoming mails from a multitude of different senders/organizations. The email above has been part of a support call and has been sent from outside our organization. Therefore we have no control over the Avast settings nor more emails with multiple content headers. But we will keep our eyes open. For now concerning this special email the workaround solves the issue.

jstedfast commented 3 years ago

@rboen Okay, thanks. I'll close this for now since you have a work-around that works, but if you find more issues like this, do feel free to reopen this or file a new issue.