php / pecl-file_formats-yaml

YAML-1.1 parser and emitter
https://pecl.php.net/package/yaml
MIT License
72 stars 32 forks source link

yaml_parse is not treating all tags as strings #69

Closed vaheen closed 1 year ago

vaheen commented 2 years ago

yaml_parse is not treating tags as strings and is trying to treat any integer (eg 1) or float (eg 1.1) as an integer before storing it as a tag. This results in 1:, 1.2: and 1.3: all being treated as 1: and overwriting each other. All other possible tags are treated as strings (eg 1.0beta: and 1.5.1:).

tested with versions:

<?php

$yaml = <<<YAML

---
  version:
    1.0beta:
      comment: will get loaded correctly
    1:
      comment: will get ignored - overwritten by the last 1.x
    1.2:
      comment: will get ignored - overwritten by the last 1.x
    1.3:
      comment: will get loaded as version 1
    1.5.1:
      comment: will get loaded correctly

YAML;

echo '<pre>';
var_dump(yaml_parse($yaml));
echo '</pre>';

output:

array(1) {
  ["version"]=>
  array(3) {
    ["1.0beta"]=>
    array(1) {
      ["comment"]=>
      string(25) "will get loaded correctly"
    }
    [1]=>
    array(1) {
      ["comment"]=>
      string(28) "will get loaded as version 1"
    }
    ["1.5.1"]=>
    array(1) {
      ["comment"]=>
      string(25) "will get loaded correctly"
    }
  }
}
vaheen commented 2 years ago

Deprecation warnings:

Deprecated:  Implicit conversion from float 1.2 to int loses precision in D:\--SOME-PATH--\test-yaml.php on line 21
Deprecated:  Implicit conversion from float 1.3 to int loses precision in D:\--SOME-PATH--\test-yaml.php on line 21

I see no need for any tags to ever be treated at integers.

bd808 commented 2 years ago

These "tags" are mapping keys. YAML mappings are implemented in PHP as arrays. The behavior you dislike is from PHP itself. PHP coerces numeric array keys to integers. See this 3v4l demo as an example: https://3v4l.org/0EMCM

vaheen commented 2 years ago

yes @bd808, that is PHP array behaviour, but the yaml_parse() does not need to inherit this problem.

Could the code replace this behaviour (please excuse the php pseudo code)

$array[$tag] = $value;

with a conversion of the tag to a string

$array[strval($tag)] = $value;

This would ensure that the library is capable of processing standard YAML files and has been without falling over a known issue.

PHP will still convert integers in string back to integers , but floats will remain as strings:

<?php

echo '<pre>';
var_dump([ 'version' => [ '1.0beta' => 'comment: will get loaded correctly'
            , '1' => 'comment: will get ignored - overwritten by the last 1.x'
            , '1.2' => 'comment: will get ignored - overwritten by the last 1.x'
            , '1.3' => 'comment: will get loaded as version 1'
            , '1.5.1' => 'comment: will get loaded correctly'
            ]]);
echo '</pre>';

outputs as:

array(1) {
  ["version"]=>
  array(5) {
    ["1.0beta"]=>
    string(34) "comment: will get loaded correctly"
    [1]=>
    string(55) "comment: will get ignored - overwritten by the last 1.x"
    ["1.2"]=>
    string(55) "comment: will get ignored - overwritten by the last 1.x"
    ["1.3"]=>
    string(37) "comment: will get loaded as version 1"
    ["1.5.1"]=>
    string(34) "comment: will get loaded correctly"
  }
}
tomterl commented 1 year ago

As far as I understand, this is not a php issue, because the YAML 1.1 spec states

YAML places no further restrictions on the nodes. In particular, keys may be arbitrary nodes see here

That means forcing keys to be strings would be violating the YAML-Spec And that means in short: If you want to ensure something is interpreted as a string, quote it.

Edit: For a real php treat, consider this yaml document N: true. As per yaml 1.1 spec, N is parsed as boolean false, for php equivalent to 0, thus resulting in:

>>> $yaml = "N: true";
=> "N: true"

>>> yaml_parse($yaml);
=> [
     true,
   ]

To get the probably intended structure, the N has to quoted:

>>> $yaml = "'N': true";
=> "'N': true"

>>> yaml_parse($yaml);
=> [
     "N" => true,
   ]
vaheen commented 1 year ago

That means forcing keys to be strings would be violating the YAML-Spec And that means in short: If you want to ensure something is interpreted as a string, quote it.

changing the format of someone else's yaml file is not an option !

These entries are equally valid in yaml. The first in not supported by yaml_parse() as it mangles the interpreted float (1.1), which the code casts as an integer (of 1) for the key.

    1.1: x
    1.1.1: y

the second entry can't be interpreted as a number and therefore has no issue being read as intended

tomterl commented 1 year ago

I assumed you had control over the file - my bad.

In any case, for the same reason you are giving (changing other peoples yaml files is not an option), the extension cannot really force all keys to be strings.

Handling IS_DOUBLE differently (line 448 in parse.c) could be an option, but would be a breaking change I think.

(I still think projects using version-strings as keys without quoting are at fault, because, mostly unintentionally IMHO, they mix data types (some versions being numbers, other's being strings))

[...]These entries are equally valid in yaml.

Yes, YAML allows floating point numbers as keys, but PHP does not - the extension could work around that, as mentioned above, but not for all incompatabilities between YAML and php.

take boolean values: php does not allow true boolean keys; a yaml document with true and 1 or false and 0 keys would loose one entry:

> $y = '{true: "true", 1: "one", false: "false", 0: "zero"}';
= "{true: "true", 1: "one", false: "false", 0: "zero"}"

> yaml_parse($y);
= [
    1 => "one",
    0 => "zero",
  ]

Quoting the keys does not help, because while the entries are preserved:

> $y = '{"true": "true", 1: "one", "false": "false", 0: "zero"}';
= "{"true": "true", 1: "one", "false": "false", 0: "zero"}"

> yaml_parse($y);
= [
    "true" => "true",
    1 => "one",
    "false" => "false",
    0 => "zero",
  ]

the semantics of the keys get lost:

> "false" == true;
= true
vaheen commented 1 year ago

not a question, this is a bug, the library can not read simple yaml files correctly

tomterl commented 1 year ago

not a question, this is a bug, the library can not read simple yaml files correctly

I tried to explain: This is not as black and white as you perceive it. YAML as a format sucks if used with inconsistencies like using unquoted version strings or unquoted boolean values as keys in mappings, because it clashes with language capabilities - php is not the only language with problems in that regard, C has no real boolean values, too and falls prey to the same problems as php in this regard.

And again: Solving your problem by implicit conversion to string can lead to subtle bugs elsewhere. Please read https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell to get a feeling for the complexity of the situation. Your problem stems from a clash of capabilities of data-representation in YAML and php, one allows for floating-point values as keys, the other not. Implicit casts are not a solution to that problem, but a mere workaround.

You find the version-keys 'simple', because you know their semantics, a yaml-parser cannot share into this knowledge, and another document type might have very different needs. The code base isn't that complex, if you have an idea how to solve your problem without throwing everyone else under the bus, I'm sure a patch would be discussed - but failing to understand the underlying problem, because it clashes with your wrong assumptions about yaml isn't very helpful to the process.

Edit: To me this version as key usage is a classic case of BIBO (Bullshit In, Bullshit Out).