Issue decoding headers not properly rfc2047 encoded

jstedfast / MimeKit

A .NET MIME creation and parser library with support for S/MIME, PGP, DKIM, TNEF and Unix mbox spools.

http://www.mimekit.net

MIT License

1.84k stars 372 forks source link

Issue decoding headers not properly rfc2047 encoded #31

Closed mol closed 10 years ago

mol commented 10 years ago

Probably can't be considered a bug as such, but I wanted to hear your thoughts on it.

Consider an email with these headers:

Content-type: text/plain; charset=koi8-r Subject: äÅÓÑÔËÁ áÎÅËÄÏÔÏ× äÎÑ

If subject is decoded using MimeKit, it's not "properly" decoded to: Десятка Анекдотов Дня. From what I understand this is because the subject header is not properly rfc2047 encoded. Decoding using koi8-r decodes it properly.

I'm thinking one could decode the subject (or other headers) using the content-type charset, if set, if the header doesn't explicitly say what charset it's using? But I suppose that's more of an IMAP client issue (that I'm incidentally working on) than a MimeKit issue? :)

jstedfast commented 10 years ago

This is a hard problem to solve because you literally have no idea what charset that header is in. It might be in the charset specified for the message body (the Content-Type charset parameter) or it could be UTF-8 or it could be the sender's locale charset, or any number of other possible charsets.

Keep in mind that there can be multiple text parts within a message (encapsulated in a multipart). When that happens, each one might have a different charset... or the message might just contain a pdf file or a jpeg, neither of which would have a charset parameter... or the toplevel MIME part might be an S/MIME encrypted part and you have to decrypt the part before you can parse the Content-Type header with the charset parameter.

So, the way I've tried to support this is to allow developers to set a fallback charset on the ParserOptions that they pass to the parser (or MimeMessage.Load). Generally what you want to do is set this to the user's preferred charset (typically his/her system's Locale charset, but your client could also provide a way for the user to override that default).

Obviously, a user may get emails from someone who uses a different locale charset than the user and so that option won't always work. The solution to this (and you'll notice mail clients like Outlook, Evolution, Thunderbird, etc all do this) is to allow the user to override the charset and then you simply re-parse the message with the user-specified charset as the fallback.

This kinda sucks in that you have to re-parse the message, so I've tried to make it possible to override the charset w/o re-parsing everything in MimeKit.

For example, TextPart has a GetText() method that takes a charset override.

For headers, you'll notice that each Header has a RawValue property that contains the raw byte[] which you can then pass into Rfc2047.DecodeText() (and then Header.Unfold() if you want to unfold it).

Since you brought this up, though, I figured I'd add a convenience method: Header.GetValue(Encoding charset)

The problem with handling this at the parser level is that there's no way for the parser to know that it got it right. If it seems a bunch of 8-bit characters, it has no idea if iso-8859-1 is the correct charset vs koi8-r, plus the header decoding happens as the message is parsed, so any header that comes before the Content-Type header won't have access to the charset parameter as a fallback...

Anyway, hopefully the convenience method I've added is helpful.

mol commented 10 years ago

Thanks for the fast and very detailed answer. I was hoping to avoid giving the user the ability to change charsets (by having it not be necessary), since we're aiming to keep a very clean email client (www.getmailbird.com), but it might turn out to be futile - we'll see :)

This seems promising though for the actual detection part: https://code.google.com/p/ude/

jstedfast commented 10 years ago

Interesting and thanks for sharing. I had written a charset detector back in 2001 or thereabouts but the masking tables needed were massive, and then the web server I was storing my project on went down and I lost the code :-(

Ironically, I wrote it precisely for the reason brought up in this issue back when I was working on the Evolution mail client.

I'll have to take a closer look at this Mozilla Charset Detection library at some point and maybe take advantage of it.

Good luck on your mail client, I watched the video and it seems pretty elegant. Very nicely done, and it certainly seems like the cleanest interface I've seen.

jstedfast commented 10 years ago

Actually, since you are writing a mail client with performance as one of your high-priority criteria - what SMTP, POP3, and/or IMAP backend are you guys using? Did you write your own?

I was looking at a bunch of the open source libraries and they made me cringe. For me, correctness is more important than performance, but performance is still pretty important to me and most of these libraries were pretty bad in both regards.

I ended up writing my own SmtpClient and Pop3Client (they are in my MailKit project here on GitHub). Not sure if you are interested in them or not (if you guys have written your own clients, you might not be interested), but thought I'd mention it in case you guys might find it useful. I'm sort of playing around with the idea of implementing an ImapClient on the "imap" branch of my MailKit project, but I haven't gotten very far yet - mostly just sketching out API design.

mol commented 10 years ago

Really? So basically the standards are no better followed now than it was in 2001 :)

I tried the detector briefly and it correctly identified the charset in the example, but I'll have to test it with some more messages. I'm thinking comparing the results to the Content-Type charset, if set, and only assume it's correct if they're the same. Also, only if not RFC 2047 encoded. It also has a confidence rating to use when deciding whether to trust it.

Thanks. I must admit I didn't think it would be this complicated making an email client. Now I know why some people said we were brave to attempt it ;)

We've yet to implement POP3 support actually, but we're using Mail.dll: http://www.limilabs.com/mail. Not free but works very well, and they're very responsive and fast to respond to issues/add improvements. Only thing is they're a bit strict following the standards, so I'm evaluating using MimeKit for the address and subject parsing part instead. Standards are great and I love that it follows them to the letter, in a way, but you just can't have a client without a "quirks mode" because of all the malformed messages around :)

jstedfast commented 10 years ago

Yea, writing a mail client can be really hard. Before working on MimeKit, I had written another MIME parser library called GMime (a very fast C library but unfortunately tied to GLib and not easily buildable on anything but Linux and other Unixes... not that it can't be done, just difficult). That library was based on my experience working on the Evolution mail client on Linux for ~6 years or so, so it had some decent "quirks mode"-type hacks already, but I continued adding more and more as the GMime library matured over the past 10 years or so.

In many ways, MimeKit has benefited a lot from my experience figuring out ways around various mail client header encoding quirks. I ranted about some of them here: http://jeffreystedfast.blogspot.com/2013/08/why-decoding-rfc2047-encoded-headers-is.html

Mail.dll was one of the libraries I was interested in comparing the performance of MimeKit against. I did performance comparisons with a bunch of the open source libraries and MimeKit was orders of magnitude faster (no joke - see the second half of this blog post: http://jeffreystedfast.blogspot.com/2013/10/optimization-tips-tricks-used-by.html ). If you get bored and are interested in writing up a simple test program like the ones I listed in my blog post, I'd love to know how MimeKit compares.

But anyway, yea, having a quirks mode (MimeKit's ParserOptions.EnableRfc2047Workarounds property is basically MimeKit's "quirks mode" and is enabled by default, although its got other workarounds that you can't disable) is really useful when trying to deal with real-world mail - having a super strict parser can be a major pain when writing a mail client for real people ;-)

mol commented 10 years ago

Your blog post (excellent btw) on why decoding RFC 2047 headers is hard is actually what brought me to MimeKit, while looking for a parser that could "magically" fix the charset issue a few of our Russian users are having :)

Sure, if you send me a sample message to parse using Mail.dll, or where exactly to split the jwz.mbox.txt file, I'll do a quick test to match the tests on your blog. Once in Mail.dll and once with MimeKit. Then we'll see which is faster. I'd assume MimeKit by far though, as you've done some pretty cool optimizations and I don't think Mail.dll is built for speed :)

We're actually having some high CPU usage issues in Mailbird at the moment, while downloading, parsing and indexing messages, because we're doing a lot at the same time, but we'll optimize it eventually. There's just so much to do :)

jstedfast commented 10 years ago

Ah, cool, I didn't know my blog had that much reach!

Here's the startrek.msg that I used in the performance testing that I blogged about:

https://gist.github.com/jstedfast/8419032 - there should be a "Download Gist" button on the left that will download the file (or a zipped version of the file). It's about ~177k or so.

mol commented 10 years ago

Yeah it's all over Google ;)

Thanks, I downloaded the file and created a small test program to try both out. MimeKit was done after 13.4937347 seconds. Mail.dll is still running :) For....something like 10 minutes now. Stopping...

Let me see. Ok, 100 messages takes 4.6257416, so it should take about 925.14832 seconds, or 15.42 minutes. I'd say MimeKit won that one :)

Can I pick your brain about an IMAP issue that I've encountered? Maybe you know something about it. Let's say I upload a message to a folder and afterwards fetch the uids of that folder. If I do it immediately afterwards, sometimes the uid of the uploaded message isn't there. Even after nooping, deselecting the folder and selecting it again or even creating a new connection to the server. Only after 5-7 seconds does the server "realize" the message is there. Have you noticed this in your experiences with Evolution?

jstedfast commented 10 years ago

Awesome, thanks for sharing the results - I've been curious as to how the commercial libraries fared against MimeKit.

The IMAP thing sounds like it might be server-specific. There's an IMAP extension called UIDPLUS that enhances commands like APPEND that is supposed to return the UID that gets assigned to the message in the server response to the append command.

If you know what the UIDNEXT property of the mailbox is (it's one of the values that gets returned when you SELECT a mailbox), you can generally assume that it will be the UID assigned to any message that you append. Of course, if you append multiple messages, you need to update your local "UIDNEXT" state (by incrementing the UIDNEXT value after each append).

There are a few caveats to that assumption that you need to be careful of, however:

The mailbox is the INBOX and so it is possible that the user is receiving new emails while you have the mailbox SELECTED.
Some IMAP servers allow multiple clients to SELECT the same mailbox concurrently (most don't, which is why you can sometimes get a READ-ONLY mode response to a SELECT, which effectively makes it equivalent to an EXAMINE). If this is the case, there may be multiple clients appending messages to that mailbox (not likely, but possible).
The server has server-side filtering running that may deliver messages to the mailbox while you have it SELECTED.

These gotchas are probably why the UIDPLUS extension exists, because without it, keeping state locally (for cached and/or offline support) is nearly impossible.

In Evolution, what I remember doing, is if the server didn't support UIDPLUS, I used the UIDNEXT assumption, but I marked the message in my database as being a throwaway record, suggesting that the next time we FETCHed ENVELOPE/FLAGS/UID/etc info from the server, we would replace those records with real data.

Hopefully that helps. But yea, IMAP is a total PITA - it's great in some ways, but a number of the extensions should really be REQUIRED features (e.g. UIDPLUS and CHANEGDSINCE are invaluable when keeping a local cache).

mol commented 10 years ago

You're welcome. Mail.dll's parser creates a much more "complete" message object though, from what I can see of the MimeKit message. MimeKit doesn't parse attachments, calendar appointments and different body structures (for multipart messages), does it? http://www.limilabs.com/static/mail/documentation/html/AllMembers_T_Limilabs_Mail_IMail.htm

Thanks :) Yeah we use UIDPLUS to get the uid for the servers that supports it - which is not many unfortunately. Outlook.com for instance doesn't support it and it's the worst server with regards to the issue I mentioned. If you upload a message to the drafts folder and then check if it's there, it won't be, for about 8 seconds :)

I've just added some retry functionality to wait up to 18 seconds for the uid. Seems to work well with Outlook.com. It happens in the background so is not visible to the user.

I was thinking about using the UIDNEXT value as the uid instead, like you say, as a throwaway value to update, but I was thinking it's a little risque. What if the user were to create another draft through the web interface and then decide to delete the draft created by Mailbird, before that uid was checked as being valid. Then Mailbird might have the wrong uid referring to the other message and would actually delete the wrong draft :)

I suppose we could double-check the message is what we think it is before performing any action on it however, but then we could just as easily search for the correct uid at that point, as it would not work in either case if not after the 8 seconds or so :) I mean - we can neither check it's the correct message nor get the uid for the message until after the 8 seconds.

Most of the time another action would likely not be performed until much later though, so yeah... might be worth it.

I'm happy to hear I'm not missing some important feature though and that IMAP is the horror I've grown to know :) I didn't know most servers didn't allow multiple SELECTs to the same folder though...that's interesting. I wonder if that could account for some connection issues we've been having. I've built a pretty elaborate (but simple) connection framework for Mailbird, so we're actively reusing connections, making it hard to debug, but very efficient. We're seeing lots of situations where we don't get a response to a request within 30 seconds though. I'm wondering if that timeout is just too low or there is something wrong somewhere...

jstedfast commented 10 years ago

You're welcome. Mail.dll's parser creates a much more "complete" message object though, from what I can see of the MimeKit message. MimeKit doesn't parse attachments, calendar appointments and different body structures (for multipart messages), does it?

It parses multiparts and "attachments" (attachments in MIME are just MIME parts with headers and content), but it doesn't parse the content of attachments (and so doesn't parse calendar appointments in text/vcard parts or HTML in text/html parts, etc).

Then Mailbird might have the wrong uid referring to the other message and would actually delete the wrong draft :)

Yea, it's not an easy problem to solve. You might even be able to use the UID + INTERNALDATE (APPEND allows you to specify a date string which I think is normally used as the INTERNALDATE value by the server).

I didn't know most servers didn't allow multiple SELECTs to the same folder though...that's interesting.

Keep in mind that my working knowledge of this is like 10 years old at this point (that and I'm basing this on my memory), so I could be wrong and/or things could have changed in the meantime.

We're seeing lots of situations where we don't get a response to a request within 30 seconds though. I'm wondering if that timeout is just too low or there is something wrong somewhere...

30 seconds is a long time. Is that before you get a complete response? Or before the first (untagged?) response arrives after sending the request? If it's the later, wow, that's a pretty big latency. I could easily see FETCH requests for lots of data taking 30 seconds to get the complete response, though. For example, 30 seconds to FETCH the UID and ENVELOPE data for every message in a mailbox might not be unreasonable if the mailbox is large.

mol commented 10 years ago

It parses multiparts and "attachments" (attachments in MIME are just MIME parts with headers and content), but it doesn't parse the content of attachments (and so doesn't parse calendar appointments).

Oh yeah, I investigated and see this now: "The MimeMessage.Body is the top-level MIME entity of the message. Generally, it will either be a TextPart or a Multipart.". Cool.

Yea, it's not an easy problem to solve. You might even be able to use the UID + INTERNALDATE (APPEND allows you to specify a date string which I think is normally used as the INTERNALDATE value by the server).

Yeah but unfortunately some servers annoyingly change it :) I'm actually using a few different methods depending on the server's capability when searching for the message uid (to make sure it's the right one). Some servers say they support searching for the (not guaranteed unique) message-id header, but never return anything for instance. You gotta love IMAP :) We do use that date as one of the fall back methods though, as a way to limit the uids to compare with as a sort of process of elimination technique :)

30 seconds is a long time. Is that before you get a complete response? Or before the first (untagged?) response arrives after sending the request? If it's the later, wow, that's a pretty big latency. I could easily see FETCH requests for lots of data taking 30 seconds to get the complete response, though. For example, 30 seconds to FETCH the UID and ENVELOPE data for every message in a mailbox might not be unreasonable if the mailbox is large.

The time from sending a request to getting the first bit of the response. I'd assume it was high enough too, but I'm not sure. Just wondering if you had encountered it with Evolution perhaps :) I'll figure it out.

jstedfast commented 10 years ago

Been on a 36-hour hacking marathon and have made awesome progress on an ImapClient implementation for MailKit.

At this point, I mostly just need to implement the metric ton of methods on ImapFolder and work out any kinks in my ImapEngine and ImapStream.

All I can say is holy crap a ton of IMAP extensions were drafted up and published since I last looked at IMAP. I'm using a bitfield of ImapCapabilities and am almost out of bits! Might have to switch to a 64-bit enum if I don't stop finding more extensions...

jstedfast commented 10 years ago

I had a thought this morning... have you tried using the CHECK command after your APPEND?

The CHECK command is supposed to force the IMAP server to flush its cache of pending writes to the mailbox, so perhaps that will work?

mol commented 10 years ago

Been on a 36-hour hacking marathon and have made awesome progress on an ImapClient implementation for MailKit.

At this point, I mostly just need to implement the metric ton of methods on ImapFolder and work out any kinks in my ImapEngine and ImapStream.

All I can say is holy crap a ton of IMAP extensions were drafted up and published since I last looked at IMAP. I'm using a bitfield of ImapCapabilities and am almost out of bits! Might have to switch to a 64-bit enum if I don't stop finding more extensions...

Cool. I might give it a try at some point - although the thought of having to migrate IMAP component is a little daunting :)

I haven't looked at the code, so I don't know if you've already implemented it, but how about proxy server support? And one thing that I find missing from Mail.dll is an event(s) to listen to, to see all requests and responses going to and from the server, to measure how many bytes have been downloaded for instance if wanting to implement a progress bar for attachment download (I actually had to create my own method for attachment download because of this, but have had some issues with it. At the moment Lotus Domino servers seem to send a few too many bytes. Not sure if it's my code or the server yet).

I had a thought this morning... have you tried using the CHECK command after your APPEND?

The CHECK command is supposed to force the IMAP server to flush its cache of pending writes to the mailbox, so perhaps that will work?

Thanks for the suggestion. I'm currently running NOOP before each operation to ensure I have the latest data (trial and error indicated that would work), so I tried switching to CHECK instead, and that actually did "the opposite". After copying a message to a folder and getting all uids, there were none in the folder again and again and again running CHECK, but the second I ran a NOOP, there were uids in the folder. Strange... :)

jstedfast commented 10 years ago

MailKit's ImapClient is still very early stages, no proxy support yet - most of the ImapFolder methods aren't implemented. Mostly what's done is the core command pipeline state machine.

I haven't figured out how I'm going to do it yet, but what I want to provide is a IProgress API - not sure if I want to have that be an argument to all of the bigger "fetch message" and "append message"-type methods or what.

From my brief reading of the RFC, it seems that what you probably want to do is send CHECK only once after appending, and then send a NOOP perhaps? The CHECK command doesn't return any untagged responses like NOOP does, but NOOP is supposed to return immediately.

Just guessing

mol commented 10 years ago

I've been testing some more, and it seems the issue is with new connections. If checking using the same connection that uploaded or copied (when using COPY), it seems the changes are reflected immediately, while another connection a second later will not find anything - until 9 or so seconds later still...

I'm pretty sure I've seen it not work though, with the same connection, at some point while working on it a long time ago, but I might be wrong.

So maybe keeping the same connection could be the key. I'll run some more tests :)