jstedfast / MailKit

A cross-platform .NET library for IMAP, POP3, and SMTP.
http://www.mimekit.net
MIT License
6.19k stars 821 forks source link

Headers with multiple encoding fail to be decoded by major email clients #139

Closed ThomasCadiou closed 9 years ago

ThomasCadiou commented 9 years ago

I am trying to send emails using MailKit but some subjects fail to be encoded properly, and some accented characters appear corrupted (question mark). This may be random, but it happens for all accented characters after the 60th in my test subject. I am simply using MimeMessage and setting the subject like that. message.Subject = "Retrouvez-nous à la Chaux-Neuve à l’occasion de la Coupe du Monde de Combiné Nordique"; I am not setting any encoding in the headers, as the body is sent properly with all its accented characters.

Here are my samples, encoded using Mailkit, to show where the problem occurs: Retrouvez-nous à la Chaux-Neuve à l’occasion de la Coupe dué Monde de Combiné Nordique The first 'é' (60th character) works, the last one doesn't. =?utf-8?B?UmV0cm91dmV6LW5vdXMgw6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2Fz?= =?utf-8?B?aW9uIGRlIGxhIENvdXBlIGR1IE1vbmRlIGRlIENvbWJpbu+/vSBOb3JkaXE=?= =?utf-8?B?dWU=?=

Retrouvez-nous à la Chaux-Neuve à l’occasion de la Coupe du éMonde de Combiné Nordique The first (61st character) and last 'é' don't work =?utf-8?B?UmV0cm91dmV6LW5vdXMgw6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2Fz?= =?utf-8?B?aW9uIGRlIGxhIENvdXBlIGR1IO+/vU1vbmRlIGRlIENvbWJpbu+/vSBOb3Jk?= =?utf-8?B?aXF1ZQ==?=

The initial subject is Retrouvez-nous à la Chaux-Neuve à l’occasion de la Coupe du Monde de Combiné Nordique Encoded using MailKit (with error): =?utf-8?B?UmV0cm91dmV6LW5vdXMgw6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2Fz?= =?utf-8?B?aW9uIGRlIGxhIENvdXBlIGR1IE1vbmRlIGRlIENvbWJpbu+/vSBOb3JkaXE=?= =?utf-8?B?dWU=?=

Encoded using System.Web.Mail (properly encoded): =?utf-8?B?UmV0cm91dmV6LW5vdXMgw6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2Fz?= =?utf-8?B?aW9uIGRlIGxhIENvdXBlIGR1IE1vbmRlIGRlIENvbWJpbsOpIE5vcmRpcXU=?= =?utf-8?B?ZQ==?=

jstedfast commented 9 years ago

MailKit doesn't do the encoding or decoding, so this really belongs under MimeKit.

That said, I've tried to reproduce the problem and I can't.

The results I get when encoding the subject are different than what you appear to be getting. For example, when I use the last subject string, I get:

Retrouvez-nous =?utf-8?b?w6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2FzaW9u?= de la Coupe du Monde de =?iso-8859-1?q?Combin=E9?= Nordique

Notice that MimeKit attempts to encode as little as possible and prefers to use iso-8859-1 instead of utf-8 when it can (as well as uses lower-case b rather than upper-case B).

So I gotta ask: are you SURE that the encoded strings are coming from MimeKit? And if so, what version of MimeKit are you using? Could you try running the latest version (the version released on NuGet is fine if that is easier than using the latest code on github).

jstedfast commented 9 years ago

This is just a theory, so it could be completely wrong, but is it possible that the SMTP or IMAP/POP3 server is re-encoding the subject and producing a broken result?

The encoded strings that you said were encoded using MimeKit are definitely broken, so that's why, when they are decoded by MimeKit, they have unicode ?'s in the string (that's .NET's fallback character for illegal byte sequences in the input).

ThomasCadiou commented 9 years ago

Sorry for posting in the wrong repository, I actually hesitated between the two and ended up picking the wrong one. After further checking, it appears that during the delivery process, the subject gets re-encoded as you suggested. I just captured the sent email using Smtp4Dev (the raw sent message) and get the subject just like you Retrouvez-nous =?utf-8?b?w6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2FzaW9u?= de la Coupe du Monde de =?iso-8859-1?q?Combin=E9?= Nordique If I open the captured email in Outlook 2013 though, I still get the same corrupted 'é' character. This is weird as the formatting is correct so there may be an issue with Outlook or Smtp4Dev. I'll have to use some other tools to double-check.

Thanks for your quick reply, I will try to investigate it further when I get time. Maybe you'd prefer me to post this elsewhere (like SO) as this is likely not really an issue with this project?

jstedfast commented 9 years ago

Don't worry about posting the bug report to the wrong project (MimeKit vs MailKit), it's not the end of the world ;-)

FWIW, in my attempts to recreate this problem earlier, I wrote an NUnit test case and compared the original subject string with the decoded subject string and they matched, so MimeKit's decoder is getting it right.

It's weird that Outlook is getting it wrong, though.

As you suggested, perhaps someone on StackOverflow will have some ideas as to why that is.

I'll close this for now since it seems like it's not a bug in MimeKit or MailKit.

Have a merry christmas!

jstedfast commented 9 years ago

Thinking about this, I think I have a pretty good idea of what the bug in Outlook (and the server software as well that is re-encoding the subject) is:

Retrouvez-nous =?utf-8?b?w6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2FzaW9u?= de la Coupe du Monde de =?iso-8859-1?q?Combin=E9?= Nordique

I bet that when Outlook parses/decodes this value, it is assuming that since the first encoded-word has a charset of UTF-8, that it doesn't even bother to examine the charset used in later encoded-words.

In other words, I think that this is Outlook's pseudo code that decodes the Subject:

var decoded = GetBytes ("Retrouvez-nous ")
    .Append (Base64Decode ("w6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2FzaW9u"))
    .Append (GetBytes (" de la Coupe du Monde de "))
    .Append (QuotedPrintableDecode ("Combin=E9")) +
    .Append (GetBytes (" Nordique"));
var subject = Encoding.GetEncoding ("utf-8").GetString (decoded);

When the pseudo-logic should look more like this:

var subject = "Retrouvez-nous " +
    Encoding.GetEncoding ("utf-8").GetString (Base64Decode ("w6AgbGEgQ2hhdXgtTmV1dmUgw6AgbOKAmW9jY2FzaW9u")) +
    " de la Coupe du Monde de " +
    Encoding.GetEncoding ("iso-8859-1").GetString (QuotedPrintableDecode ("Combin=E9")) +
    " Nordique";
ThomasCadiou commented 9 years ago

That is an interesting analysis of the issue, I didn't go that far. If this is what's happening, I guess the only thing that could be done would be to force the encoding in utf-8 for the whole subject header (or to another single encoding), right? My guess was to use the the International field from the FormatOptions passed to the Send method of SmtpClient. Unfortunately Smtp4dev doesn't support that format and I get The SMTP server does not support the SMTPUTF8 extension I didn't try any further with that option since I'm supposed to use the application with unpredictable SMTP servers and therefore can't assume this extension will work on any/all of them.

Then again I don't have much time on my hands for this specific issue and don't really want to waste your time so don't feel the need to investigate any further. I'll have a thorough look at it and will come back here when I figure out the solution.

Have a merry Christmas too!

jstedfast commented 9 years ago

Looks like Thunderbird has exactly this problem as well: https://bugzilla.mozilla.org/show_bug.cgi?id=317263

Clearly everyone needs to use my code because it handles all this correctly. I think it's time that Mozilla and Microsoft contract out to me to write their MIME encoders/decoders ;-)

It's depressing to me that these 2 popular mail clients utterly fail to properly handle the sample included with rfc2047 (which is where the header encoding rules are defined):

Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

I've gone ahead and added FormatOptions.AllowMixedHeaderCharsets (defaults to true) which can be disabled to work around this issue. Just add the following snippet to your code to change the global default:

FormatOptions.Default.AllowMixedHeaderCharsets = false;

I'm not sure I like the name of that property, so I may change it, but for now I guess it's good enough.

Fixed in https://github.com/jstedfast/MimeKit/commit/54d9f87d044bca849338ab02e8428099b83b85cc

ThomasCadiou commented 9 years ago

From what I've learnt developing an emailing solution, the RFCs are to be taken lightly and you have to expect everything to work poorly, sadly... Thank you for doing all the investigation and fixing, I'll look forward to the nuget update to push that solution in my main branch. Keep up the great work, this is so much better than the .NET solution :-)

By the way, wow... 2005 and still there.

jstedfast commented 9 years ago

Yea, I'm more used to having to deal with broken inputs though, and not so much having to carefully craft outputs for another client to be able to parse and/or decode them properly.

jstedfast commented 9 years ago

I just released MimeKit 1.0.4 to NuGet this morning with this fix.

ThomasCadiou commented 9 years ago

Thank you!

ghost commented 8 years ago

This issue has not been resolved as yet, I had to go to the FormatOptions.cs and set the value as false in the default constructor. Am I missing something here?.

jstedfast commented 8 years ago

The fix was to add the option in the first place. That said, I've just changed the default value to false.