rurban / Cpanel-JSON-XS

Improved fork of JSON-XS
http://search.cpan.org/dist/Cpanel-JSON-XS/
Other
39 stars 45 forks source link

Inconsistent behavior with decoding escaped non-characters vs non-escaped non-characters #227

Open G4Vi opened 2 months ago

G4Vi commented 2 months ago

Decoding an escaped non-character works as expected:

my $scalar = qq|"\\ufffe"|;
Cpanel::JSON::XS->new->allow_nonref->decode($scalar);

and produces a warning as described at https://metacpan.org/pod/Cpanel::JSON::XS#6.-Unicode-noncharacters-only-warn,-as-in-core.:

Unicode non-character U+FFFE is not recommended for open interchange at test.pl line 8.

Decoding a regular non-character does not:

my $scalar = qq|"\x{FFFE}"|;
Cpanel::JSON::XS->new->allow_nonref->decode($scalar);

No warning is produced.

Neither is a warning is produced when UTF-8 decoding is also done:

my $scalar = qq|"\x{FFFE}"|;
utf8::encode($scalar);
Cpanel::JSON::XS->new->allow_nonref->utf8->decode($scalar);

I'm undecided on whether it should warn when decoding without UTF-8 decoding as on one hand, these characters should have been discovered when they entered a Perl string (and may have warned already), but on the other hand it means during JSON decoding, escaped and non-escaped non-chars are handled differently. However, when UTF-8 decoding is done during JSON decoding it seems pretty clear they should be handled the same.

Under https://metacpan.org/pod/Cpanel::JSON::XS#JSON-and-ECMAscript is this paragraph outdated? "Unicode non-characters between U+FFFD and U+10FFFF are decoded either to the recommended U+FFFD REPLACEMENT CHARACTER (see Unicode PR #121: Recommended Practice for Replacement Characters), or in the binary or relaxed mode left as is, keeping the illegal non-characters as before." In my testing I never saw the non-characters be converted to the replacement character (nor do I think they should be).

I'm happy to make a PR after it's clarified how this should be handled.

rurban commented 2 months ago

Such clarifications need to be decided upstream

Grinnz commented 2 months ago

Given the corrigendum, an exception or conversion to the replacement character is certainly incorrect. Warning behavior would be up to preference. Several other tools handle this incorrectly as well.

G4Vi commented 2 months ago

JSON:PP does not warn for either case nor does it do any replacement when decoding non-chars. Should we still ask them? I wouldn't mind stripping out the non-char warnings to match JSON:PP.

rurban commented 2 months ago

I would be in favor to match JSON::PP, even if Unicode recommends otherwise.