pali / Email-Address-XS

Parse and format RFC 5322 email addresses and groups
https://metacpan.org/pod/Email::Address::XS
1 stars 1 forks source link

non-ASCII addresses should not be considered valid #7

Closed jwilk closed 2 years ago

jwilk commented 2 years ago

This code:

use Email::Address::XS;
say Email::Address::XS->parse("\xFF\@jwilk.net")->is_valid;

prints 1.

But RFC 5322 addresses are ASCII-only.

pali commented 2 years ago

Hello! Non-7-bit characters are allowed to fully support Internationalized Email Headers as defined in RFC 6532.

Also Email::Address::XS is used also for storing UNICODE email addresses and then processed by MIME encoder to convert it to full 7-bit ASCII object (just in different UNICODE representation - RFC 2047).

jwilk commented 2 years ago

Fair enough (although the address in my example is neither valid per RFC 6532 nor could it be MIME-encoded). But if this is intentional, it should be documented.

pali commented 2 years ago

although the address in my example is neither valid per RFC 6532

I do not see reason why. U+FF is fully valid UNICODE code point. It is ÿ - LATIN SMALL LETTER Y WITH DIAERESIS. In UTF-8 it is encoded as 0xC3 0xBF.

Lets see:

$ perl -MEncode -e 'my $unicode = "\xFF"; my $utf8 = encode("UTF-8", $unicode); print $utf8;' | xxd -g 1
00000000: c3 bf                                            ..                          ..
jwilk commented 2 years ago

Oh, sure, I could have encoded U+00FF as UTF-8; but that's not what I did.

$ perl -MEmail::Address::XS -E 'say Email::Address::XS->parse("\xFF\@jwilk.net")->format' > addr

$ xdd < addr
00000000: ff40 6a77 696c 6b2e 6e65 740a            .@jwilk.net.

$ isutf8 < addr
(standard input): line 1, char 0, byte 0: Expecting bytes in the following ranges: 00..7F C2..F4.
pali commented 2 years ago

Yea and this is the infamous bug. If the API input is in UNICODE or in UTF-8. But thankfully this XS module is written in the way that all non-7-bit characters are passed as-is and also the internal perl utf8 flag is respected and correctly propagated. So not having checks for character >= 0x80 (non-7-bit-ASCII) just make this things work correctly without need to define if API is in UNICODE, UTF-8 or any other encoding backward compatible with 7-bit-ASCII.

pali commented 2 years ago

So... I do not see there any issue. Just user has to know how to use UNICODE in Perl correctly.

pali commented 2 years ago

You cannot print UNICODE string to stdout or file. UNICODE string is just sequence of ordinals (code points, numbers) without any specific format how are numbers encoded to byte stream. UTF-8 is one specific encoding of UNICODE strings (but there are lot of others) to byte stream. So if you have Perl UNICODE string (sequence of ordinals) and want to save it into file, you first need to convert it to byte stream.

If you try to print or store something which is not byte stream then result is same as garbage in, garbage out.

pali commented 2 years ago

I updated documentation in commit a844a70ff96d62a9dff1db66a4a9622ff24f870b to address Internationalized Email Headers and UNICODE.