rjbs / Email-MIME

perl library for parsing MIME messages
20 stars 30 forks source link

Decoding UTF-8 header weird behavior / bug #78

Closed bbkr closed 2 years ago

bbkr commented 2 years ago

This bug happens when decoding Quoted-Printable headers, and it is complete weirdo:

Single ć as name:

my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C4=87?= <x@example.com>}."\r\n\r\n");
say $parsed->header_str('from');
ć <x@example.com>

(perfect, works)

Single é as name:

my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C3=A9?= <x@example.com>}."\r\n\r\n");
say $parsed->header_str('from');
� <x@example.com>

(broken, decoding replaced valid UTF-8 character with replacement character fffd)

Combined éć as name:

my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C3=A9=C4=87?= <x@example.com>}."\r\n\r\n");
say $parsed->header_str('from');
éć <x@example.com>

(é suddenly works?)

Email::MIME 1.949 Email::Address::XS 1.04

pali commented 2 years ago

Have you forgot to call binmode *STDOUT, ':utf8';? Otherwise say would not be able to print Unicode string to STDOUT in UTF-8.

bbkr commented 2 years ago

Thanks, that was it!

pali commented 2 years ago

Single ć as name:

my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C4=87?= <x@example.com>}."\r\n\r\n");
say $parsed->header_str('from');
ć <x@example.com>

(perfect, works)

Nope, it does not work. It produce warning Wide character in say which you probably disabled or ignored. This warning is important here because it says that Perl cannot print something in specified encoding (which is probably ISO-8859-1) and printed it in UTF-8.

$ perl -W -Mstrict -MEmail::MIME -Mfeature=say -e 'my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C4=87?= <x@example.com>}."\r\n\r\n"); say $parsed->header_str("from");'
Wide character in say at -e line 1.
ć <x@example.com>

Single é as name:

my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C3=A9?= <x@example.com>}."\r\n\r\n");
say $parsed->header_str('from');
� <x@example.com>

(broken, decoding replaced valid UTF-8 character with replacement character fffd)

This is not broken and works correctly. You have not explicitly configured *STDOUT to print in UTF-8 and therefore default encoding (which is ISO-8859-1) was used. Your terminal probably is not configured in Perl's default encoding (ISO-8859-1) and therefore prints this garbage.

You can verify that output is correct in ISO-8859-1 by sending perl output to iconv which will do conversion from ISO-8859-1 to UTF-8 (I guess your terminal is in UTF-8):

$ perl -W -Mstrict -MEmail::MIME -Mfeature=say -e 'my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C3=A9?= <x@example.com>}."\r\n\r\n"); say $parsed->header_str("from");' | iconv -f latin1 -t utf-8
é <x@example.com>

(my terminal is UTF-8, perl printed in latin1=iso-8895-1 and iconv converted output from perl encoding to my terminal encoding)

Combined éć as name:

my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C3=A9=C4=87?= <x@example.com>}."\r\n\r\n");
say $parsed->header_str('from');
éć <x@example.com>

(é suddenly works?)

It also produce warning:

$ perl -W -Mstrict -MEmail::MIME -Mfeature=say -e 'my $parsed = Email::MIME->new(q{From: =?UTF-8?Q?=C3=A9=C4=87?= <x@example.com>}."\r\n\r\n"); say $parsed->header_str("from");'
Wide character in say at -e line 1.
éć <x@example.com>

But what happens here? Why Perl in third case automatically did something (with warning)? This has nothing to do with Email::MIME and neither Email::Address::XS. Unfortunately this is standard Perl behavior. It is a bug which due to backward compatibility will never be fixed and is documented as The "Unicode Bug".

If you have not read about it then look into perlunicode documentation: https://metacpan.org/pod/perlunicode#The-%22Unicode-Bug%22

It is important to understand how Perl works with Unicode as it is different than in other programming languages and misunderstanding may lead to other bugs...