rjbs / Email-MIME

perl library for parsing MIME messages
20 stars 30 forks source link

MIME messes with unicode #37

Closed andreas-p closed 1 year ago

andreas-p commented 7 years ago

Using msgconvert, html message bodies are messed up since the text part is unicode and concatenating them will implicitely convert the html body. Working patch was supplied fixing the issue, don't ask for more information (not subscribed)

choroba commented 7 years ago

https://github.com/rjbs/Email-MIME/pull/36

rjbs commented 7 years ago

Is this the msgconvert in question? http://www.matijs.net/software/msgconv/

choroba commented 7 years ago

I guess so.

pali commented 7 years ago

Please provide example email or any other test case. Without it and without proper steps it is impossible to locate where is the problem.

pali commented 7 years ago

As @andreas-p still has not provided any input on which such problem happen, I think this bug can be closed.

I guess it is either related to UTF-8 Email::MIME::ContentType attributes or to messing with wide Unicode strings in Email::Simple.

Long and UTF-8 Content-Type attributes were fixed by https://github.com/rjbs/Email-MIME-ContentType/pull/5 in Email::MIME::ContentType version 1.020.

And safety checks which prevent messing with non-bytes (above U+FF) characters in Email::Simple is in this pull request: https://github.com/rjbs/Email-Simple/pull/17

mvz commented 4 years ago

I think this may be related to https://github.com/mvz/email-outlook-message-perl/issues/14 which finally includes instructions for reproduction.

I'm not yet sure where things go wrong, since several things interact here: The Email::MIME library, my code in Email::Outlook::Message, Perl's output encoding behavior, and finally the relevant email standards.

mvz commented 4 years ago

I think I've found the root cause of the problem: msg files can store text either as 'unicode' (UTF16-LE) or 8-bit strings. Email::Outlook::Message would decode the former into a Perl string but keep the later as its original sequence of bytes. Later on, these would be assigned to body when creating mail parts. Since body expects a sequence of bytes, this would lead to breakage in the case of the unicode data. Assigning to body_str would of course lead to breakage in the other case, which would be caught sooner due to the encoding check.

TL;DR: I'm convinced the problem is in Email::Outlook::Message and this ticket can be closed.

rjbs commented 1 year ago

Thanks @mvz