rjbs / Email-MIME

perl library for parsing MIME messages
20 stars 30 forks source link

Fix as_string returning unicode #36

Closed andreas-p closed 7 years ago

andreas-p commented 7 years ago

When msgconvert assembled the email, I found the html message part to be corrupted, having the utf8 characters replaced by utf8-reencoded bytes. I found that parts_add will append the $part->as_string to $body. When the first part is a unicode text/plain body, the subsequent text/html (8bit encoded) part will be implicitely converted to unicode as well, which is wrong.

as_string should always return a string, never unicode. The PR fixes this.

pali commented 7 years ago

This patch is totally wrong!

You should never use utf8::is_utf8 function. It just tells internal representation how is perl string stored in C/XS code. And is fully invisible in pure perl code. E.g. my $str = "\x{A9}" and my $str = "\N{U+A9}" has same content, but one has internal utf8 flag and one not.

So I would close this pull request as wrong/invalid/incorrect.

Rather please report bug and provide example with data which triggers it.

andreas-p commented 7 years ago

If as_string returns a unicode-flagged string, the concatenation of $body will convert subsequent parts to unicode as well, which may not happen. as_string needs to return a sequence of bytes, not to be interpreted/converted in any way. parts_add needs to append the parts opaquely, never convert them, and as_string returning unicode-flagged is to blame here.

pali commented 7 years ago

If as_string returns a unicode-flagged string ...

It is UTF8 flag, not Unicode flag. and UTF8 flag is invisible for (pure) perl code. As wrote, this patch is incorrect and break Unicode support as it encode string to UTF-8 only and only if invisible internal UTF8 flag is set.

This my request still remains:

Rather please report bug and provide example with data which triggers it.

andreas-p commented 7 years ago

While that utf8-flag (which is a specific type of unicode, FWIW...) may be invisible, it forces implicite conversion when concatenating. That's the backdraw of perl, makes it really easy to mix up bytes and characters...

I cant produce test data myself, only have production .msg data I can't provide which have Content-Type: text/plain; charset="UTF-8", Content-Transfer-Encoding: 8bit which makes up the utf8 flagged part, and Content-Type: text/html; charset="UTF-8",Content-Transfer-Encoding: 8bit which is stored without utf8, but converted when added. The multipart data as shown above is probably the standard (together with a rtf part) created by Outlook/Exchange.

pali commented 7 years ago

While that utf8-flag (which is a specific type of unicode, FWIW...) may be invisible ...

That flag just say how are data internally in memory stored. It has nothing to do if they are UTF-8 encoded or not! You can have UTF-8 data with SVf_UTF8 flag. And also you can have UTF-8 without SVf_UTF8 flag!

... it forces implicite conversion when concatenating.

It just forces change of internal representation of string in memory. But perl string is same and not changed.

As wrote in first post, try that example yourself. E.g. with this code:

my $str1 = "\x{A9}";
my $str2 = "\N{U+A9}";
printf "str1 SVf_UTF8: %d\n", utf8::is_utf8($str1) ? 1 : 0;
printf "str2 SVf_UTF8: %d\n", utf8::is_utf8($str2) ? 1 : 0;
print "strings are same\n" if $str1 eq $str2;
str1 SVf_UTF8: 0
str2 SVf_UTF8: 1
strings are same

Now your patch run utf8::encode on $str2 and then those two strings are not same anymore and break whole Unicode and this module. Therefore your patch is incorrect and so unacceptable.

I cant produce test data myself...

Sorry, but I still do not understand your problem. So really please try to generate data, which you think trigger some bug, and provide output from Email::MIME and also output which you think is correct. Then I can look at it.

andreas-p commented 7 years ago

The flag certainly needs to be removed, and downgrade didn't do the job. perl does SOMETHING when concatenating one utf8-flagged and one non-utf8-flagged , resulting utf8 in total.

Don't use print trying to prove anything; it probably does conversion itself! I debugged by writing all intermediate data, and found that encoding the first multipart from utf8 kept the second multipart intact.

I encounter the problem on any outlook mail converted using msgconvert (which as a but itself, it writes utf8-encoded where it should write raw; fixed in latest git). Any mail with Umlauts or other accents should do.

pali commented 7 years ago

The flag certainly needs to be removed...

I will repeat it again: You should not care about SVf_UTF8 flag unless you are writing XS code. It is fully invisible in non-XS code. And Email::MIME is non-XS.

So no flag is going to be removed.

Don't use print trying to prove anything...

Hm? To print if SVf_UTF8 flag is set? Seems you fundamentally did not understood it. I'm stopping conversation here as all needed information are there. This patch is NACK from my side as it is total nonsense. If there is bug in Email::MIME, please report it with information how to reproduce. Thanks!

andreas-p commented 7 years ago

There's a reason why perl is my most hated language. From my side, every info is there too. Take it or leave it, took me 6 hours to trace this down.

pali commented 7 years ago

every info is there too.

So what info is there?

And what have you did?

  1. Sent patch without any useful information in commit message
  2. Commit message "Fix as_string returning unicode" does not say anything relevant as perl strings in 5.8+ are Unicode, so as_string already returns Unicode string.
  3. Your patch looks like total nonse as it is trying to do something based on internal SVf_UTF8 flag which is absolutely not needed in perl (non-XS) code
  4. Your patch is breaking Unicode support and breaking as_string function

Any point above is very good reason to reject patch.

pali commented 7 years ago

Please provide what and where is problem fogether with example input + code which cause it.

How should other people understand and reproduce problem and then verify that some patch really fixes it?

andreas-p commented 7 years ago

Already spent too much time on this. I don't appreciate being teached not to use is_utf8 if I have found this to be the very reason for text/plain messing up the text/html part.

pali commented 7 years ago

I don't appreciate being teached not to use is_utf8

Then please study yourself what is is_utf8 doing and do it before you try to break other modules.

You still have not written what is the problem you are tring to fix. How to reproduce it. And how to verify that it is fixed. How should other people know that in Email-MIME is some problem (and not in other modules!) if nobody is able to reproduce it? And you still have not provided code for Email-MIME which trigger this problem?

andreas-p commented 7 years ago

The code triggering it is msgconvert. PR is retracted, I'm generally vastly uninterested in perl, case closed.