ingydotnet / yaml-libyaml-pm

Perl Binding to libyaml
http://search.cpan.org/dist/YAML-LibYAML/
33 stars 37 forks source link

Parsing strings that look like numbers changed in release 0.87 #111

Open veryrusty opened 1 year ago

veryrusty commented 1 year ago

The change to turn off the internal POK flag in release 0.87 has impacted how strings that look like numbers are parsed.

Using a variation of the code from #68:

use v5.16;
use YAML::XS qw/ Load /;
use Devel::Peek qw/ Dump /;
use Data::Dumper qw/ Dumper /;
my $yaml = "foo: 0003\nbar: 03.30";
my $data = Load $yaml;
say Dumper $data;
Dump $data->{foo};
Dump $data->{bar};

YAML::XS@0.86 gives strings

$VAR1 = {
          'bar' => '03.30',
          'foo' => '0003'
        };

SV = PVIV(0x5648b041dde0) at 0x5648b04145f8
  REFCNT = 1
  FLAGS = (IOK,POK,pIOK,pPOK)
  IV = 3
  PV = 0x5648b0422f60 "0003"\0
  CUR = 4
  LEN = 10
SV = PVNV(0x5648b03fb300) at 0x5648b0414688
  REFCNT = 1
  FLAGS = (NOK,POK,pIOK,pNOK,pPOK)
  IV = 3
  NV = 3.3
  PV = 0x5648b05e4000 "03.30"\0
  CUR = 5
  LEN = 10

While YAML::XS@0.87 has munged them, numified 0003 and dropped the zeros from 03.30

$VAR1 = {
          'bar' => '3.3',
          'foo' => 3
        };

SV = PVNV(0x55b424e98300) at 0x55b424eb15f8
  REFCNT = 1
  FLAGS = (IOK,pIOK)
  IV = 3
  NV = 3
  PV = 0x55b424ec00a0 "0003"\0
  CUR = 4
  LEN = 10
SV = PVNV(0x55b424e98320) at 0x55b424eb1688
  REFCNT = 1
  FLAGS = (NOK,pNOK)
  IV = 3
  NV = 3.3
  PV = 0x55b424fb2c10 "3.3"\0
  CUR = 3
  LEN = 32

With $YAML::XS::QuoteNumericStrings defaulting to true, I'd expect those leading zeros to be preserved (as the 0.86 release does).

perlpunk commented 1 year ago

First of all, $YAML::XS::QuoteNumericStrings is an option for dumping. It will quote strings that look like numbers, to make sure they are loaded as strings again.

The behaviour you are seeing is an incompatible change, I can see that. However, it's more correct. An unquoted scalar 03.30 or 0003 is a number according to the YAML spec. The zeroes are presentation details that should not be preserved. Can you tell us more about the use case that relies on this behaviour?

veryrusty commented 1 year ago

An unquoted scalar is a number according to the YAML spec

Double checking I've understood the spec correctly; that arrises from the extended list of regexs from the core schema?

Can you tell us more about the use case that relies on this behaviour

We have 18+ years of data collection specifications that use YAML docs as their primary source. They get processed into several forms, some of which are publicly available. eg: https://docs.validator.com.au/nocc/02.10/appendix-a.html#file-header-record . In many places there are zeroes that need to be (and were being) preserved.

All the specification have a version string that is \d\d.\d\d, like 03.30 which when parsed to 3.3 and doesn't match its intended format. Another field must be \d{4} with its missing value defined as missing: 0003; similarly 3 doesn't match the intended format.

For now, we've pinned YAML::XS@0.86 in our builds as a short term fix.

perlpunk commented 1 year ago

An unquoted scalar is a number according to the YAML spec

Double checking I've understood the spec correctly; that arrises from the extended list of regexs from the core schema?

That's a not so trivial question. YAML::XS doesn't implement the YAML 1.2 Core schema, but also not the 1.1 types. It only loads things as numbers for which the perl function looks_like_number returns true.

And so far it used SvIV_please on those, resulting in scalars which have both the IOK or NOK and the POK flag on.

Can you tell us more about the use case that relies on this behaviour

We have 18+ years of data collection specifications that use YAML docs as their primary source. They get processed into several forms, some of which are publicly available. eg: https://docs.validator.com.au/nocc/02.10/appendix-a.html#file-header-record . In many places there are zeroes that need to be (and were being) preserved.

All the specification have a version string that is \d\d.\d\d, like 03.30 which when parsed to 3.3 and doesn't match its intended format. Another field must be \d{4} with its missing value defined as missing: 0003; similarly 3 doesn't match the intended format.

So in general if you want to have leading or trailing zeroes preserved, the correct thing to do is quote it in your YAML files. If you are ever using a different YAML parser (in perl or a different language) then you will lose those zeroes.

I can imagine that it would be possible to still preserve the PV with the zeroes and turn off POK at the same time, but that is something I have to try out, and will cost me time.

Reverting the change will result in others not using YAML::XS, and making it configurable is also quite some work.

@ingydotnet any comments from you?

ingydotnet commented 1 year ago

The only thing that comes to mind immediately is that we add an option to do it either way.

perlpunk commented 1 year ago

btw, here are several YAML modules listed with their behaviour on various strings: https://perlpunk.github.io/YAML-PP-p5/schema-examples.html

perlpunk commented 1 year ago

For the record, I tested:

SvIV_please(scalar);
SvPOK_off(scalar);

instead of using SvIOK_only/SvNOK_only and it had the same effect of losing the zeroes.

veryrusty commented 1 year ago

Thanks for the comments and further investigation @perlpunk & @ingydotnet .

After kicking it over with the team here, we will work back through our yaml docs and quote the values that should be treated as strings. As much of a PITA that will be, it'll be less than burdening you with ongoing maintenance of a option/switch to allow either behaviour.

As such, this can be closed.

perlpunk commented 1 year ago

@veryrusty Apparently there are others with similar problems, so we are probably going to revert the change and figure out what to do next.

ingydotnet commented 1 year ago

112

karenetheridge commented 1 year ago

Note that you can use single quotes in the YAML data to indicate that the value should be interpreted as a string, not a number. e.g. compare these two:

$ perl -MData::Dumper -MYAML::XS -wle'print Dumper(Load("---\n- 0.3\n- 0.30\n- 03.30\n"))'
$VAR1 = [
          '0.3',
          '0.3',
          '3.3'
        ];

$ perl -MData::Dumper -MYAML::XS -wle'print Dumper(Load("---\n- 0.3\n- \x{27}0.30\x{27}\n- \x{27}03.30\x{27}\n"))'
$VAR1 = [
          '0.3',
          '0.30',
          '03.30'
        ];

(\x{27} used here rather than a literal ' only because it's a shell oneliner)

karenetheridge commented 1 year ago

As an alternative solution, perhaps a "interpret all numbers as strings" config value could be added, but it should not default to on, because it contradicts the yaml spec (or do whatever with it, if what the spec says is not important).

Just please, don't produce values that have both POK and NOK set :)

karenetheridge commented 8 months ago

Are there any additional thoughts for what to do about this?

perlpunk commented 8 months ago

I can imagine adding an option that offers a way around this behaviour, but not as a default. The current behaviour must stay the default, otherwise it would break too much; that is what we know now. But I have a hundred small things on my plate; and when I think of doing something for YAML::XS I always hesitate because I have to workaround Zilla::Dist, and over the years I got really tired of it. It doubles the amount of work for an XS module, because you cannot just call make && prove -l. If we had Dist::Zilla here, I would do it. So if @ingydotnet would allow such an option, PRs are welcome, but I don't feel like doing it.

karenetheridge commented 8 months ago

@ingydotnet would you consider allowing a small development-tailored Makefile.PL in the repository for XS dists? Moose (etc) does this to allow a developer to still do perl Makefile.PL; make test in the repository.

perlpunk commented 3 months ago

@ingydotnet would you consider allowing a small development-tailored Makefile.PL in the repository for XS dists? Moose (etc) does this to allow a developer to still do perl Makefile.PL; make test in the repository.

That will be solved by #117