Open olivier2 opened 7 months ago
The change to IsMboxMarker is wallpapering over the real problem. Somehow buffering state is getting messed up such that it is skipping over the real mbox marker and a ton of headers and getting confused, thinking that the mbox marker is in the middle of the DKIM header value.
This could be caused by a few things:
Have you tried the ExperimentalMimeParser?
At the very least it should be faster.
Thanks for the quick feedback!
- The MimeParser is set to RespectContentLength and the previous message has a Content-Length header value that specifies a length that is longer than the real contents
I'm using defaults options, so RespectContentLength is set to false. Also, I tried recreating the email context with several messages before and after, but that didn't reproduce the problem so far.
- The MimeParser is set to treat the stream as persistent and your code is consuming the message inside the loop that is parsing the messages, thereby throwing off the stream's seek position out from under the MimeParser.
Persistent was the default (false) in the test (see posted code snippet) and I'm not messing with the stream.
- There's a buffer boundary issue bug in the parser.
That's my assumption as well, but I'm not sure how to tweak the code to force a reproduction of the issue.
Have you tried the ExperimentalMimeParser?
I just did and it runs flawlessly!
I'm using defaults options, so RespectContentLength is set to false. Also, I tried recreating the email context with several messages before and after, but that didn't reproduce the problem so far.
Yea, if it's a buffering issue (and based on the fact that you are using the defaults and not doing anything with the stream, it sounds like it is), it'll be difficult to construct an mbox that will reproduce the issue because it'll likely be a corner case where some token or another needs to cross a read boundary at some stage in parsing, causing things to get out of whack.
IIRC, the ExperimentalMimeParser is a little more robust in that sense (if my memory is correct, it's been a while) because it tends to consume everything as it goes whereas the old MimeParser implementation has a few areas where it will abort an attempt to consume something if it can't get the whole token, call ReadAhead to refill the buffer (which moves any remaining buffer to the "prebuffer") and then tries again. This is most likely the cause of the issue, but I obviously can't be 100% certain (and there are several places that do that, so narrowing that down would be a pain).
I think the most likely culprit would be the StepHeaders() (or related) logic in MimeParser.cs. That code has gotten a bit scary, even for me, which is one of the reasons it got rewritten in ExperimentalMimeParser.
FWIW, my plan is to make ExperimentalMimeParser the default in v5.0 and rename the current MimeParser to LegacyMimeParser and more-or-less [Obsolete]
it.
I meant to do that for v4.0 but forgot, haha.
Describe the bug I get a "Failed to parse message headers." exception when parsing my ~11GB gmail mbox. This happens reliably, always on the same email. When the exception is raised, MimeParser.MboxMarker contains the string 'From : Subject : Date : ', from the middle of the email's DKIM-Signature header (see attachements). Mimekit has no problem parsing the offending email when it's extracted to its own file. So far, extracting 1MB of data around the email in an attempt to reproduce the problem has not been successful. A previous complete gmail mbox (now lost), that contained the same email, also had no parsing issue. I'm guessing it's related to the ReadAheadSize somehow?
Platform:
To Reproduce
Result
The location of the error state origin is already lost when the exception is raised, but in this particular case, it comes from line 965 in StepHeaders(...):
Expected behavior No exception
Code Snippets
Additional context
Workaround If I modify IsMboxMarker() to allow 'From ', but ignore 'From :', the exception is not raised. However, I'm not familiar with RFC standards or the performance impact to suggest this as a permanent fix (also, my test case to prove the fix includes a very private ~11GB file):