Multi-line recipient is parsed incorrectly

matteocontrini commented 4 days ago

Describe the bug

Hello, I'm parsing an email that has a very weird formatting of the To header. It looks like this:

To: <hello@example.com>, <hello@example.emai

 l>, <hello@example.net>

MimeKit is able to parse it but the address that is split between the lines is malformed:

hello@example.com
hello@example.emai
hello@example.net

There's a missing l as you can see.

Seeing that the library already handles the empty line and is able to continue the parsing of the To field, it probably should also recognize that the above line isn't finished.

Expected behavior

hello@example.com
hello@example.email
hello@example.net

Code Snippets

using MimeKit;

var ms = new MemoryStream();

ms.Write("""
         To: <hello@example.com>, <hello@example.emai

          l>, <hello@example.net>
         """u8);

ms.Position = 0;

var msg = MimeMessage.Load(ms);

Console.WriteLine(msg.To.ToString());

Thanks!

jstedfast commented 3 days ago

The issue is that it's syntactically illegal to put a line break in that location, that's why MimeKit is not handling it.

Clearly, whatever email program(?) generated those lines did not follow the specifications correctly.

Do you know how it was generated?

matteocontrini commented 3 days ago

The issue is that it's syntactically illegal to put a line break in that location, that's why MimeKit is not handling it.

Yeah I imagined that... MimeKit is however already capable of understanding that the From field continues, so I though it may be possible to support these non-compliant emails without much effort.

A couple of things that probably weren't immediately clear from my message above:

the empty line isn't actually empty, there's a space character (otherwise it would indicate the body has started, I guess?)
the line starting with the l has a space before the l

Do you know how it was generated?

I don't. This email is an email containing a DMARC report from Mimecast (which is ironic since their main product is email-focused). I received it a few hours ago and I use MimeKit to parse the email and extract the XML report.

I can certainly try to contact them and ask them to fix this, but I wouldn't be surprised if I get stuck at the sales layer.

jstedfast commented 3 days ago

Yea, you probably won't make much headway in calling them up to report a bug like this. I was just curious.

MimeKit is however already capable of understanding that the From field continues

It's actually the same parser for From/To/Cc/Bcc/etc - and it's not that line breaks in those headers isn't allowed, it just matters where in the text that those line breaks are.

For example, this would be syntactically legal:

To: <hello@example.com>, <hello@example.email

 >, <hello@example.net>

And so would this:

To: <hello@example.com>, <hello@example.

 email>, <hello@example.net>

The address parser is designed to be a token parser as per the email spec, and so these would all be considered tokens:

<
hello
@
example
.
email
>

Because of that, a line break or space character between any of those tokens is allowed, but not in the middle of any of those tokens, if that makes sense (hopefully I am explaining it well).

Technically, the older versions of the specs are the only ones that allow that level of fine-grained breaking. The newer versions of those specifications allow much less.

For more detailed reading, you can check out rfc2822, section 2.2.2 and rfc2822, section 3.4 which contains the address syntax. And then there's section 4.4 which talks about obsolete syntax (which MimeKit also supports). Basically, RFC 2822 is an updated syntax over RFC 0822. And there are newer versions than 2822 as well. For example, 5322 and some 65XX specs that add UTF-8 headers (prior to 65XX, only ASCII was allowed so any non-ASCII would need to be encoded somehow).

I'll leave this open until I have a chance to look into the address parser code to see what would be needed to support this scenario and if it isn't too difficult to support, I'll add some work-around logic.

matteocontrini commented 3 days ago

Thanks a lot for the details, yes the tokenization explanation makes sense.

I'm wondering how Mimecast creates their email messages... I'll try to show them this issue and I'll update you if I get somewhere.

In the meantime I'll try to implement some workaround on my end if these emails start to come in more frequently.

Thanks for your time!

jstedfast / MimeKit

Multi-line recipient is parsed incorrectly #1076