mustangostang / spyc

A simple YAML loader/dumper class for PHP
MIT License
701 stars 207 forks source link

Convert quoted \n characters to new lines. #28

Closed quicksketch closed 10 years ago

quicksketch commented 10 years ago

Fixes https://github.com/mustangostang/spyc/issues/27, quoted "\n" characters should be converted to new lines when parsing. Likewise, unquoted \n characters should not be converted to new lines.

ryanuber commented 10 years ago

Hey @quicksketch, this looks good. It seems like there is some deviation in almost every YAML implementation, but I haven't found a clear answer to the correct behaviour.

Spyc now loads this:

php > var_dump(Spyc::YAMLLoad('- "string with literal \n"'));
array(1) {
  [0]=>
  string(21) "string with literal 
"
}
php > var_dump(Spyc::YAMLLoad('- "string with literal \\n"'));
array(1) {
  [0]=>
  string(21) "string with literal 
"
}

Ruby's YAML parser does pretty much the same thing:

irb(main):002:0> YAML.load '- "string with literal \\n"'
=> ["string with literal \n"]
irb(main):003:0> YAML.load '- "string with literal \n"'
=> ["string with literal \n"]

PyYAML gets a little funky, but its using the libyaml C library:

>>> yaml.load('- "string with literal \n"')
['string with literal ']
>>> yaml.load('- "string with literal \\n"')
['string with literal \n']

The YAML spec does mention that two backslashes \\ represent a literal backslash. So I'm confused on which behaviour is right.

I think what we have in Spyc is actually fine now with this pull - I am wondering what anyone else's thoughts are on what the most correct behaviour is.

quicksketch commented 10 years ago

Thanks for both the merge @mustangostang and the comments @ryanuber. I'm pretty sure we're treating new lines correctly now, but I think @ryanuber is right that backslashes are not interpreted correctly.

From looking at the spec, my understanding is that escaping with backslashes is only needed when inside quotes. If unquoted, a single backslash does not need to be escaped. I think this summary covers it very explicitly: http://yaml.org/spec/1.0/#id2566934

So in short, we have a new issue with backslashes. The first of @ryanuber's examples is correct, but the second one is not. From the second example, this YAML:

- "string with literal \\n"'

Is equal to this YAML:

- string with literal \n

Both should be parsed as:

array(
  0 => 'string with literal \n',
)
mustangostang commented 10 years ago

Well, we need a way to have to have \n characters in a quoted string, so \\n actually seems fine. I'll draft some support for that.

Does YAML specification say something about \r and \t, or is \n the only case we should support?

quicksketch commented 10 years ago

Does YAML specification say something about \r and \t, or is \n the only case we should support?

There's a whole bucket of special characters listed here: http://yaml.org/spec/1.0/#id2568446. So it includes not only \r, \t, and \n but also \a, \b, \e, \f... there are 20 total. I'm not sure if PHP interprets all of those characters when inside of double quotes, but it's Unicode standard, so I would assume it would do a good enough job.

For all of these characters, they should be able to be escaped with double-backslashes in quoted text, or printed as-is for unquoted text:

- "string with new lines \\r\\n and \\t tabs"
- string with new lines \r\n and \t tabs

Should both be parsed as:

array(
  0 => 'string with new lines \r\n and \t tabs',
)