Getting FormatException: Failed to parse message headers due to a buffering bug

olivier2 commented 7 months ago

Describe the bug I get a "Failed to parse message headers." exception when parsing my ~11GB gmail mbox. This happens reliably, always on the same email. When the exception is raised, MimeParser.MboxMarker contains the string 'From : Subject : Date : ', from the middle of the email's DKIM-Signature header (see attachements). Mimekit has no problem parsing the offending email when it's extracted to its own file. So far, extracting 1MB of data around the email in an attempt to reproduce the problem has not been successful. A previous complete gmail mbox (now lost), that contained the same email, also had no parsing issue. I'm guessing it's related to the ReadAheadSize somehow?

Platform:

OS: Windows 10
.NET Runtime: CoreCLR
.NET Framework: .NET 8.0
MimeKit Version: 4.3.0

To Reproduce

Run program with offending mbox file

Result

Failed to parse message headers.
   at MimeKit.MimeParser.ParseMessage(Byte* inbuf, CancellationToken cancellationToken) in C:\MimeKit\MimeKit\MimeParser.cs:line 1923
   at MimeKit.MimeParser.ParseMessage(CancellationToken cancellationToken) in C:\MimeKit\MimeParser.cs:line 2016

The location of the error state origin is already lost when the exception is raised, but in this particular case, it comes from line 965 in StepHeaders(...):

if (headers.Count == 0) {
    if (state == MimeParserState.MessageHeaders) {
        // ignore From-lines that might appear at the start of a message
        if (toplevel && (length < 5 || !IsMboxMarker (start, true))) {
            // not a From-line...
            inputIndex = (int) (start - inbuf);
  /* here -> */     state = MimeParserState.Error;
            headerIndex = 0;
            return false;
        }

Expected behavior No exception

Code Snippets

static void Main(string[] args)
{
    using(var fileStream = File.OpenRead(@"All mail Including Spam and Trash-002.mbox"))
    {
        var parser = new MimeParser(stream, MimeFormat.Mbox);
        while (!parser.IsEndOfStream)
        {
            parser.ParseMessage();
        }  
    }
}

Additional context

exception_complete_anon.txt anonymized complete offending email
exception_MboxMarkerOffset_anon.txt anonymized offending email starting from MboxMarkerOffset

Workaround If I modify IsMboxMarker() to allow 'From ', but ignore 'From :', the exception is not raised. However, I'm not familiar with RFC standards or the performance impact to suggest this as a permanent fix (also, my test case to prove the fix includes a very private ~11GB file):

... *inptr++ == (byte) 'm' && *inptr++ == (byte) ' ' && *inptr != (byte) ':'

jstedfast commented 7 months ago

The change to IsMboxMarker is wallpapering over the real problem. Somehow buffering state is getting messed up such that it is skipping over the real mbox marker and a ton of headers and getting confused, thinking that the mbox marker is in the middle of the DKIM header value.

This could be caused by a few things:

The MimeParser is set to RespectContentLength and the previous message has a Content-Length header value that specifies a length that is longer than the real contents
The MimeParser is set to treat the stream as persistent and your code is consuming the message inside the loop that is parsing the messages, thereby throwing off the stream's seek position out from under the MimeParser.
There's a buffer boundary issue bug in the parser.

Have you tried the ExperimentalMimeParser?

At the very least it should be faster.

olivier2 commented 7 months ago

Thanks for the quick feedback!

The MimeParser is set to RespectContentLength and the previous message has a Content-Length header value that specifies a length that is longer than the real contents

I'm using defaults options, so RespectContentLength is set to false. Also, I tried recreating the email context with several messages before and after, but that didn't reproduce the problem so far.

The MimeParser is set to treat the stream as persistent and your code is consuming the message inside the loop that is parsing the messages, thereby throwing off the stream's seek position out from under the MimeParser.

Persistent was the default (false) in the test (see posted code snippet) and I'm not messing with the stream.

There's a buffer boundary issue bug in the parser.

That's my assumption as well, but I'm not sure how to tweak the code to force a reproduction of the issue.

Have you tried the ExperimentalMimeParser?

I just did and it runs flawlessly!

jstedfast commented 7 months ago

I'm using defaults options, so RespectContentLength is set to false. Also, I tried recreating the email context with several messages before and after, but that didn't reproduce the problem so far.

Yea, if it's a buffering issue (and based on the fact that you are using the defaults and not doing anything with the stream, it sounds like it is), it'll be difficult to construct an mbox that will reproduce the issue because it'll likely be a corner case where some token or another needs to cross a read boundary at some stage in parsing, causing things to get out of whack.

IIRC, the ExperimentalMimeParser is a little more robust in that sense (if my memory is correct, it's been a while) because it tends to consume everything as it goes whereas the old MimeParser implementation has a few areas where it will abort an attempt to consume something if it can't get the whole token, call ReadAhead to refill the buffer (which moves any remaining buffer to the "prebuffer") and then tries again. This is most likely the cause of the issue, but I obviously can't be 100% certain (and there are several places that do that, so narrowing that down would be a pain).

I think the most likely culprit would be the StepHeaders() (or related) logic in MimeParser.cs. That code has gotten a bit scary, even for me, which is one of the reasons it got rewritten in ExperimentalMimeParser.

jstedfast commented 7 months ago

FWIW, my plan is to make ExperimentalMimeParser the default in v5.0 and rename the current MimeParser to LegacyMimeParser and more-or-less [Obsolete] it.

I meant to do that for v4.0 but forgot, haha.

jstedfast / MimeKit

Getting FormatException: Failed to parse message headers due to a buffering bug #991