jstedfast / MimeKit

A .NET MIME creation and parser library with support for S/MIME, PGP, DKIM, TNEF and Unix mbox spools.
http://www.mimekit.net
MIT License
1.82k stars 368 forks source link

Header Value Encoding #883

Closed gungora closed 1 year ago

gungora commented 1 year ago

Hello,

Let's say I have a MIME message as follows:

From: John Doe <jdoe@machine.example>
To: Mary Smith <mary@example.net>
Subject: =?GB18030?B?1qTD9w==?=
Date: Fri, 21 Nov 1997 09:55:06 -0600
Message-ID: <1234@local.machine.example>

This is a message just to say hello.
So, "Hello".

If I access the subject of the message as follows:

var message = MimeMessage.Load(@"Z:\test.eml");
var subject = message.Subject;

I get the following result: Ö¤Ã÷, which appears correctly when viewed using the GB18030 charset.

On the other hand, if I access it as follows:

var subject = message.Headers.First(x => x.Id == HeaderId.Subject).GetValue(Encoding.GetEncoding("GB18030"));

... then I get 证明, which appears to be the UTF-8 encoded version of the subject.

A couple of questions:

  1. Is the above discrepancy expected? Since the charset is specified in the RFC 2047-encoded subject, I was thinking that overriding the encoding by calling the GetValue() method with the same encoding would not yield a different result.

  2. Getting the header in UTF-8 form works better for me. If I need to call the GetValue() method with the corresponding encoding to do that, how would I go about determining the charset of the RFC 2047-encoded header? I do not see that information attached to the header itself.

Many thanks!

jstedfast commented 1 year ago

The discrepancy you describe is not at all expected.

I just added the following test case to MimeKit's unit tests to verify that MimeKit does the right thing (and it does):

[Test]
public void TestIssue883 ()
{
    const string rawMessageText = @"From: John Doe <jdoe@machine.example>
To: Mary Smith <mary@example.net>
Subject: =?GB18030?B?1qTD9w==?=
Date: Fri, 21 Nov 1997 09:55:06 -0600
Message-ID: <1234@local.machine.example>

This is a message just to say hello.
So, ""Hello"".";

    using (var source = new MemoryStream (Encoding.UTF8.GetBytes (rawMessageText))) {
        var message = MimeMessage.Load (source);

        Assert.AreEqual ("证明", message.Subject);
    }
}

The test passes.

jstedfast commented 1 year ago

Ah, I bet I know what the problem is that you are hitting.

You probably forgot to call:

System.Text.Encoding.RegisterProvider (CodePagesEncodingProvider.Instance);

You need to call that before making any calls to MimeKit or MailKit.

MimeKit initializes a charset mapping when the first call to MimeKit is made and if it can't find a charset, it maps the charset name to iso-8859-1 (because that is always available).

gungora commented 1 year ago

That was it 😊 I believe MimeKit used to call Encoding.RegisterProvider() itself in CharsetUtils, but it looks like this changed back in July. Thanks for bringing this up—we will call it ourselves now.