mikel / tmail

TMail is a Ruby Email Handler.
http://tmail.rubyforge.org/
Other
73 stars 36 forks source link

Iconv Exception when parsing wrong encoded mutlibyte strings in TMail::Unquoter.unquote_and_convert #3

Open Soph opened 15 years ago

Soph commented 15 years ago

This isn't really a bug, since the passed string is wrong encoded. But since lot of sites seem to send this kind of encoded subjects it would be worth a fix/workaround.

For example this subject from a facebook notification email:

Subject: =?UTF-8?Q?   Stefan_Haubold_hat_dich_als_FreundIn_auf_Facebook_hinzugef=C3?= 
=?UTF-8?Q?=BCgt_...?=

The problem is that facebook parser isn't multibyte safe. In utf-8 the german umlaut "ü" in the word "hinzugefügt" is C3BC. Facebook is encoding their subject as an encoded-word with qouted-printable encoding. According to the encoded-word RFC an encoded-word shouldn't be longer then 75 chars including 'charset', 'encoding', 'encoded-text', and delimiters. Facebook follows this rule but splits in the middle of the multibyte char.

When TMail tries to parse this subject, and passes the string to iconv, an exception is raised:

Iconv::InvalidCharacter: "\303"

We fixed this on our side like this:

begin
  subject = TMail::Unquoter.unquote_and_convert_to(envelope.subject,'utf-8')
rescue Iconv::InvalidCharacter
  subject = TMail::Unquoter.unquote_and_convert_to(envelope.subject.gsub(/\?=\s?=\?utf\-8\?q\?/i, ""),'utf-8')
end

Just make one large encoded-word out of all parts and call unqoute_and_convert_to again. Maybe this fix needs some more work for other multi-byte encodings (utf-16 ?). But it works for us so far.

Here is a diff patch for test_quote.rb. And the needed fixture for that test http://gist.github.com/184582