dump_string must not care about Perl's internal representation of a variable

2shortplanks commented 5 years ago

As of 0.0016 dump string actually cares what the internal UTF8 flag of a scalar is set to and behaves differently depending on what state it is in.

Perl's current internal representation should have no user visible effects. A library should be able to return a scalar containing the bytes \x{C3}\x{A9} with the UTF8 flag off, or a scalar containing \x{C3}\x{83}\x{C2}\x{A9} with the UTF8 flag on, and both be considered the byte sequence for é when decoded as UTF-8.

perlpunk commented 5 years ago

Could you give me a small example code, and what you expect? Are you talking only about the result from dump or also about the reloaded data?

YAML::PP is now behaving like YAML::XS (with the exception that YAML::XS::Dump() returns utf8 encoded data).

YAML 1.2 says that YAML stream consists of unicode characters, and the contents of scalar nodes too. That means that if I get input that does not have the utf8 flag, and has the content \x{C3}\x{A9} for example, it is considered binary data, which YAML does not support, so I need to upgrade it to utf8.

Consider the input string \303\300 (\x{c3}\x{c0}). This is binary data, and it is not valid unicode.

my $yaml = $yp->dump_string("\303\300");

If I didn't upgrade the input string, the resulting YAML would contain: --- \303\300\n which is invalid unicode and therefor invalid YAML.

Note that I'm not a unicode expert, so please correct me if I'm saying something wrong.

I can think of adding options for YAML::PP to control behaviour, but first I would like to understand your use case.

Edit: removed YAML.pm and YAML::Tiny Edit: fix typo

perlpunk commented 5 years ago

Here is a comparison. The behaviour for YAML.pm and YAML::Tiny changes, if the input also contains a decode_utf8("ü"):

% perl -wE'
use Encode;
use Devel::Peek;
use YAML ();
use YAML::XS ();
use YAML::PP ();
use YAML::Tiny ();
my $utf8 = decode_utf8("ü");
my $binary = "\303";
Dump(YAML::Dump($binary, $utf8));
Dump(YAML::PP::Dump($binary, $utf8));
Dump(YAML::Tiny::Dump($binary, $utf8));
Dump(YAML::XS::Dump($binary, $utf8));
Dump(YAML::Dump($binary));
Dump(YAML::PP::Dump($binary));
Dump(YAML::Tiny::Dump($binary));
Dump(YAML::XS::Dump($binary));'
SV = PV(0x562fa0b76b70) at 0x562fa0baac88
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0568280 "--- \303\203\n--- \303\274\n"\0 [UTF8 "--- \x{c3}\n--- \x{fc}\n"]
  CUR = 14
  LEN = 24
  COW_REFCNT = 1
SV = PV(0x562fa05675f0) at 0x562fa0bb4770
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0bb58f0 "--- \303\203\n--- \303\274\n"\0 [UTF8 "--- \x{c3}\n--- \x{fc}\n"]
  CUR = 14
  LEN = 24
  COW_REFCNT = 0
SV = PV(0x562fa0567670) at 0x562fa09f03f8
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0bb8b50 "--- \303\203\n--- \303\274\n"\0 [UTF8 "--- \x{c3}\n--- \x{fc}\n"]
  CUR = 14
  LEN = 17
  COW_REFCNT = 1
SV = PV(0x562fa0567670) at 0x562fa09f03f8
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x562fa0bb55b0 "--- \303\203\n--- \303\274\n"\0
  CUR = 14
  LEN = 24
SV = PV(0x562fa0567670) at 0x562fa0bb4320
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK)
  PV = 0x562fa0aae840 "--- \303\n"\0
  CUR = 6
  LEN = 10
  COW_REFCNT = 1
SV = PV(0x562fa05676a0) at 0x562fa0a9ed70
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x562fa0aaeef0 "--- \303\203\n"\0 [UTF8 "--- \x{c3}\n"]
  CUR = 7
  LEN = 10
  COW_REFCNT = 0
SV = PV(0x562fa0b76ab0) at 0x562fa0bb4338
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x562fa0bb57f0 "--- \303\n"\0
  CUR = 6
  LEN = 10
SV = PV(0x562fa0b76ab0) at 0x562fa0bb4338
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x562fa0bb57f0 "--- \303\203\n"\0
  CUR = 7
  LEN = 10

So to me YAML::PP and YAML::XS behave more consistent.

pali commented 4 years ago

@2shortplanks is fully right!

UTF8 flag is relevant only for XS code and says if returned C char* buffer is encoded in UTF-8 or in Latin1. UTF8 flag does not (or rather should not) expose to pure Perl code, which YAML::PP is.

About YAML::XS, it has bugs in handling Unicode. And YAML::PP should not try to emulate these bugs, but be rather correct Unicode implementation.

I have somewhere tests for YAML::Syck (not for YAML::XS) where it shows that YAML::Syck is broken in handling of Unicode. If you want these tests I could try to find them...

pali commented 4 years ago

https://metacpan.org/pod/Encode#is_utf8

Typically only necessary for debugging and testing
Don't use this flag as a marker to distinguish character and binary data
If STRING has UTF8 flag set, it does NOT mean that STRING is UTF-8 encoded and vice-versa
utf8 also has the utf8::is_utf8 function

So usage of is_utf8 function is in YAML::PP wrong. It does not say anything about binary data or UTF-8 data.

As @2shortplanks said, YAML::PP must not care about UTF8 flag exposed by is_utf8 function.

perlpunk commented 4 years ago

I'm not saying anyone of you is wrong. But I asked for a use case. A code example.

I would like to do it right, and as far as I can see for detecting invalid characters or characters that need to be quoted I need a string with unicode characters. I have two choices: do a decode_utf8 or utf8::upgrade. Right?

I can leave out the check for is_utf8 and call utf8::upgrade in every case. But I guess that results in the same behaviour anyway.

choroba commented 4 years ago

Another option would be to fail with invalid input.

pali commented 4 years ago

utf8::upgrade in pure perl code basically do nothing (it is just optimization for following string operations) and can be fully avoided. It does not do any validation nor check.

decode_utf8 takes sequence of octets (scalar with values from range 0x00...0xFF), treats them as UTF-8 sequence and returns sequence of Unicode code points (scalar value with values from range 0x000000..0x10FFFF). So it is doing conversion from UTF-8 to Unicode.

And now question is what you need to. It is needed to know:

1) What is the user input? UTF-8 (octets) or Unicode (code points)? 2) What is expected on output? UTF-8 (octets) or Unicode (code points)?

To make it more easier, I would take an example from Cpanel::JSON::XS module which is JSON encoder / decoder.

Cpanel::JSON::XS's decode_json function is: expecting on its input UTF-8 (octets) string and returns structure (hash/array/...) with string values in Unicode. encode_json function takes as it input perl structure with string values in Unicode and returns one scalar in UTF-8.

So I think that YAML encoder/decoder could do same thing as JSON encoder/decoder.

perlpunk commented 4 years ago

Another option would be to fail with invalid input.

Well, what do you mean with invalid input? How can I detect it? :)

perlpunk commented 4 years ago

@pali YAML::PP dump is supossed to take a data structure with Unicode characters and returns a string with Unicode characters. load works the same the other way around.

If I get latin1 and operate on it, checking for characters that need to be quoted, that can get wrong results. That's why I do utf8::upgrade.

Like I said in my first comment, I could add an option to control behaviour. But I would like to know the use case(s) first.

2shortplanks commented 4 years ago

I sorry Tina, I don’t understand what you’re saying.

It’s my understanding that YAML::PP is meant to take in a data structure that can contain strings and return a scalar that contains a bytes.

In either case you don’t need to know how Perl is storing the data internally. It could store the characters as one character per byte. It could use multiple bytes to represent a character. It could paint symbols on the side of elephants. None of it matters. Please pretend the utf8 flag doesn’t exist.

I do not understand what you mean when you say “ If I get latin1 and operate on it, checking for characters that need to be quoted, that can get wrong results.”

The characters that Perl has in its data structure aren’t stored in Latin-1 or UTF-8. They’re characters not bytes. You don’t have to “check” for characters to be encoded - all characters need to be encoded with Encode::encode (because they’re characters not bytes).

If I try and encode [“L\x{e9}on”] then I expect the result to work. “L\x{e9}on” Is a perfectly good string that can be represented in UTF-8.

Mark

On Wed, Jan 22, 2020 at 3:25 PM Tina Müller (tinita) < notifications@github.com> wrote:

@pali https://github.com/pali YAML::PP dump is supossed to take a data structure with Unicode characters and returns a string with Unicode characters. load works the same the other way around.

If I get latin1 and operate on it, checking for characters that need to be quoted, that can get wrong results. That's why I do utf8::upgrade.

Like I said in my first comment, I could add an option to control behaviour. But I would like to know the use case(s) first.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/perlpunk/YAML-PP-p5/issues/16?email_source=notifications&email_token=AAALNYWBD6WEFAMBXR5EDNLQ7BQPHA5CNFSM4HOBO4W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJT7RIQ#issuecomment-577239202, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAALNYTSENYXQTF4ERKUFV3Q7BQPHANCNFSM4HOBO4WQ .

pali commented 4 years ago

@pali YAML::PP dump is supossed to take a data structure with Unicode characters and returns a string with Unicode characters. load works the same the other way around.

In this case it is simple: You do not have to (or rather should not) use any utf8 function.

Everything in pure perl is character related, one character is one Unicode code point.

Latin1/UTF-8 is needed to handle only in XS/C code which works with char* C buffers.

perlpunk commented 4 years ago

@2shortplanks ok, I think I understand now what result you expect. I just have to clarify a couple of things. So if you encode the perl data ["L\x{e9}on"] (L\351on), you want to get back the exact same value "---\n- L\351on\n".

Module	Result
Cpanel::JSON::XS	`["L\303\251on"]`
JSON::PP	`["L\303\251on"]`
YAML::PP (with `utf8::upgrade`)	`---\n- L\303\251on\n` (`[UTF8 "---\n- L\x{e9}on\n`)
YAML::PP (without `utf8::upgrade`)	`---\n- L\351on\n`
YAML::XS	`---\n- L\303\251on\n`
YAML	`---\n- L\351on\n`
YAML::Syck	`--- \n- L\351on\n`

So currently the YAML::PP (and YAML::XS) result is the same as Cpanel::JSON::XS & JSON::PP, but YAML::PP returns unicode characters.

It gets interesting as soon as I add another value to encode: ["L\x{e9}on", decode_utf8("ä")] then I get the following:

Module	Result
Cpanel::JSON::XS	`["L\303\251on","\303\244"]`
JSON::PP	`["L\303\251on","\303\244"]`
YAML::PP (with `utf8::upgrade`)	`---\n- L\303\251on\n- \303\244\n"` (`[UTF8 "---\n- L\x{e9}on\n- \x{e4}\n"]`)
YAML::PP (without `utf8::upgrade`)	`---\n- L\303\251on\n- \303\244\n"` (`[UTF8 "---\n- L\x{e9}on\n- \x{e4}\n"]`)
YAML::XS	`---\n- L\303\251on\n- \303\244\n`
YAML	`---\n- L\303\251on\n- \303\244\n"`
YAML::Syck	`--- \n- L\351on\n- \303\244\n`

(YAML::Syck is a bit weird, and YAML::XS::Load doesn't accept it.)

That means that as soon as I add other items to the data, the first value also changes. That's because due to the decoded "ä" the utf8::upgrade happens implicitly.

What I don't like about it, is that adding other data changes the kind of the result.

But like I said, I'm still not very good in unicode things. So my documentation says that Dump returns unicode characters. I wonder if that is still true if I remove the upgrade. Because if I Dump the string "\344", I get back "\344", but this is a latin1 character? On the other hand, comparing the result to decode_utf8("---\n- ä\n") returns 1.

What I'm also wondering, why is the current behaviour a problem for you? The internal result string might look different, but the string is the same.

Can someone help removing my confusion? :)

pali commented 4 years ago

Instead of Devel::Peek::Dump, you should look at output from: print join ' ', map { sprintf "U+%04x", ord($_) } split //, $string This will print sequence of ordinals stored in $string. And this is important.

Cpanel::JSON::XS::encode_json is automatically doing conversion from Unicode to UTF-8 -- equivalent of passing $output = Encode::encode("UTF-8", $input). If you want to prevent doing conversion from Unicode to UTF-8 in Cpanel::JSON::XS, you need to use its decorator oop API: Cpanel::JSON::XS->new->encode($input) (instead of API Cpanel::JSON::XS::encode_json($input)).

pali commented 4 years ago

Reading output from Dump is sometimes hard as you need to know what to read: You should always look at UTF8 part if is available and ignores other. Therefore I suggest to read output from my above form from previous post.

Can someone help removing my confusion? :)

I will try it. In string perl scalar you have always stored sequence of ordinals (numbers from 0 to 0x10FFFF). When comparing two strings via eq Perl is doing comparision on ordinals.

There are two internal representations of ordinals in scalar, but in pure perl code you do not have (easy) access to it (and you should not care about it). When needed Perl itself automatically convert from one representation to another if something requires it (explicit conversion can be done by those utf8::upgrade and downgrade function; but it is not necessary as Perl do it automatically when needed). So you can call eq without any problem on two scalars which are in different internal representation. Perl always correctly compares them (e.g. convert them to same representation and then do comparision).

I would suggest you to look at output from code which I posted in previous post on your tested modules, so see what is there really stored.

pali commented 4 years ago

@perlpunk let me know if it is more clear for you, or you need to explain some specific part of Unicode. I will try if there are still some unclear parts.

perlpunk commented 4 years ago

@pali thanks!

From what I see as the output from Devel::Peek::Dump and your snippet, it looks to me that both results actually return the correct string, no matter if I use upgrade or not, just with a different internal representation.

So if I leave it out, it won't be upgraded, unless necessary if it gets concatenated with another string.

If I remeber correctly, I made this change because someone wanted to dump binary data and there was a problem. I can't find the issue, though. But the test suceeds without the upgrade.

I'm wondering about the documentation

Input strings should be Unicode characters. If not, they will be upgraded with utf8::upgrade. Output will return Unicode characters (decoded).

What would be the best description, if I take out the upgrade?

I'm also wondering if the current code is really broken, in a way that the result would behave differently? When the internal representation is important for the result? Or it just about removing an unnecessary upgrade?

While looking at the Emitter code, I also think I just fixed some control-character escaping. I will also look in to your issue about YAML::PP::Schema::Binary!

pali commented 4 years ago

From what I see as the output from Devel::Peek::Dump and your snippet, it looks to me that both results actually return the correct string, no matter if I use upgrade or not, just with a different internal representation.

Well, utf8::upgrade does not change logical value in string, only internal representation. So calling utf8::upgrade does not change result and you always get correct output (with and also utf8::upgrade call).

Normally you need to call utf8::upgrade just for debbuging or testing purposes. E.g. test that your C/XS module works correctly when it get string scalar with any internal representation. Or another purpose of using utf8::upgrade is when using buggy/broken C/XS module which does not work correctly with one of internal string representation.

But for pure perl module there is basically no need to use it.

pali commented 4 years ago

What @2shortplanks wrote in his report is to not look at UTF8 flag. In pure perl it can be accessed by Encode::is_utf8 or utf8::is_utf8 functions. And on more places in YAML::PP code is is_utf8 used. This is what should be eliminated. utf8::upgrade is harmless, but not needed.

pali commented 4 years ago

What would be the best description, if I take out the upgrade?

Just say: Input must be Unicode. And output will be in Unicode.

I'm also wondering if the current code is really broken, in a way that the result would behave differently?

Looking at code and is_utf8 is called on these places:

YAML::PP::Emitter::scalar_event

    unless (utf8::is_utf8($value)) {
        utf8::upgrade($value);
    }

utf8::upgrade do nothing on already-upgraded-string, so upgrade can be called unconditionally.

t/45.binary.t

            if (utf8::is_utf8($reload)) {
                utf8::downgrade($reload);
            }

utf8::downgrade do nothing on already-downgraded-string, so downgrade can be called unconditionally.

So both these cases does not lead to broken code, but it is suspicious that is_utf8 is used.

Last usage of is_utf8 is in YAML::PP::Schema::Binary which is problematic, but I for it I created separate issue.

perlpunk commented 4 years ago

Thanks @pali! I have pushed my changes so far to the escape branch (I started with escaping control character issues, see #17 )

perlpunk commented 4 years ago

I just uploaded version 0.018_001 with my fixes

pali commented 4 years ago

In documentation is utf8::upgrade still mentioned, even it was removed: https://metacpan.org/pod/release/TINITA/YAML-PP-0.019/lib/YAML/PP.pm#dump_string

pali commented 4 years ago

So if you read from a file, you should decode it, for example with Encode::decode_utf8($bytes).

Encode::decode_utf8 documentations says that this function should not be used for data exchange: https://metacpan.org/pod/Encode#decode_utf8

So I think it is not a good idea to suggest using this function for decoding external data.

pali commented 4 years ago

Anyway, original problem about internal representation as described in this ticket seems to be already fixed. Thanks!

perlpunk commented 4 years ago

@pali thanks, fixed in git

pali commented 4 years ago

Thank you! I think that this issue is fixed now and can be closed.

Btw, Encode::encode_utf8 has same warning: https://metacpan.org/pod/Encode#encode_utf8

perlpunk / YAML-PP-p5

dump_string must not care about Perl's internal representation of a variable #16