raw string literal testing disagrees with documentation

gibson042 commented 1 year ago

jmespath.site Grammar defines the grammar for raw string literals like

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x20-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

and Raw String Literals includes a search('\\', "") -> "\\" example implying that \\ is an escape sequence representing a single U+005C REVERSE SOLIDUS just like \' is an escape sequence representing a single U+0027 APOSTROPHE.

However, tests/literal.json in this repository includes test cases contradicting the above, such as { "expression": "'\n'", "result": "\n" } (a raw string literal including a U+000A LINE FEED, which is not covered by raw-string-char) and { "comment": "Backslash not followed by single quote is treated as any other character", "expression": "'\\\\'", "result": "\\\\" } (added in 2016 by c0f7923dd1a091b209bd9b4247a89de7442b0a60).

Both disagreements ultimately affect whether or not raw string literals are capable of representing all possible sequences of code points (respectively those including C0 control characters and those including U+005C REVERSE SOLIDUS), and the latter is especially odd—I've never heard of any other escaping approach in which the escape prefix itself is unrepresentable.

springcomp commented 1 year ago

While the example from the specification you are referring to is indeed incorrect, the spirit of the raw-string literal is basically to accept any character as is. It only processes the following two escape sequences:

backslash followed by simple quote -> simple quote.
backslash followed by itself -> backslash.

I think you have addressed this point in the following pull-request.

However I do not see any conflicting examples in the tests/literal.json you linked to.

May I suggest that you are incorrectly interpreting the example you are referring to which in fact, does not contain any backslash character at all, but indeed contains a linefeed? 🤔

Expression	Expressed as JSON	Result	Comment
`'␊'` \| `'\n'` \| `"\n"`	U+000A LINE FEED
`'\u03a6'` \| `'\\u03a6'` \| `"\\u03a6"`	Do not interpret escaped Unicode
`'foo\'bar'` \| `'foo\\'bar'` \| `"foo'bar"`	Can escape the single quote
`'\z'` \| `'\\z'` \| `"\\z"`	Backslash not followed by single quote is treated as any other character",
`'\\'` \| `'\\\\'` \| `"\\\\"`	Backslash not followed by single quote is treated as any other character",

gibson042 commented 1 year ago

While the example from the specification you are referring to is indeed incorrect, the spirit of the raw-string literal is basically to accept any character as is. It only processes the following two escape sequences:

backslash followed by simple quote -> simple quote.

backslash followed by itself -> backslash.

That makes sense to me, but the second bullet point is contradicted by the last test case at https://github.com/jmespath-community/jmespath.test/blob/aa6fb5f942e852af1781d1f208c4069a15794a80/tests/literal.json#L183-L197 . If backslash followed by backslash were an escape sequence representing a single backslash, then that case would instead look more like "Can escape the single quote":

         {
-          "comment": "Backslash not followed by single quote is treated as any other character",
+          "comment": "Can escape backslash",
           "expression": "'\\\\'",
-          "result": "\\\\"
+          "result": "\\"
         }

May I suggest that you are incorrectly interpreting the example you are referring to which in fact, does not contain any backslash character at all, but indeed contains a linefeed?

I don't think so, and I'll demonstrate using your table:

Expression	Expressed as JSON	Result	Comment	Response
`'␊'`	`'\n'`	`"\n"`	U+000A LINE FEED	not currently allowed by `raw-string-char = (%x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape`
`'\u03a6'`	`'\\u03a6'`	`"\\u03a6"`	Do not interpret escaped Unicode	✔
`'foo\'bar'`	`'foo\\'bar'`	`"foo'bar"`	Can escape the single quote	✔
`'\z'`	`'\\z'`	`"\\z"`	Backslash not followed by single quote is treated as any other character	✔
`'\\'`	`'\\\\'`	`"\\\\"`	Backslash not followed by single quote is treated as any other character	JSON text `"\\\\"` expresses a string consisting of two consecutive backslashes rather than a string consisting of just a single backslash, so this result is analogous to the preceding `'\z'` ↔︀︁︎︎︎ `"\\z"` rather than to the `'\''` ↔️︎ `"'"` that I think we both consider preferable

springcomp commented 1 year ago

I realize that control characters are not currently specified as valid for raw-string literals. Thanks for pointing this out. Indeed it seems we need to update the grammar rule accordingly. (#16)

Now I understand the last example as well. The JMESPath expression is a raw-string that contains two consecutive backslashes, which represents how a single backslash is escaped. Thus the resulting string should be a single backslash, which in JSON is represented by a sequence of two backslashes 🤔😏 (#15)

I agree with your assessment.

springcomp commented 1 year ago

Should we really allow control characters in raw-string literals 🤔 ?

Like:

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x00-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

That feels somehow... Uncanny 😲

springcomp commented 1 year ago

Or maybe just a subset of those supported as valid escape sequences in JSON:

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x08-09 / %x0A / %x0C-0D / %x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x08-09 / %x0A / %x0C-0D / %x20-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

gibson042 commented 1 year ago

I would prefer to prohibit C0 control characters in raw strings, but if any are allowed then they all should be (and I don't know your comfort level with backwards-incompatible changes to the test suite, even if they are already supported by the formal grammar).

Alternatively, if it is a goal for all possible strings to be representable as raw literals (which also implies interpretation of \\ as an escape sequence for U+005C REVERSE SOLIDUS \, making '\\' equivalent to `"\\"` and `"\u005C"`), then C0 control characters should be allowed.

springcomp commented 1 year ago

I agree to forbid usage of control character in JMESPath expressions. Seems that was never the intended spirit when introduced in JEP-12

gibson042 commented 1 year ago

Seems that was never the intended spirit when introduced in JEP-12

Oh, thanks for digging that up! Its examples actually contain an explicit rebuttal:

A raw string literal that contains new lines:
'foo
bar
baz!'
The above expression would be parsed as a string that contains new lines:
foo
baz
bar!

And even more usefully, the "Disallow single quotes in a raw string" alternative documents its intent to fully subsume JSON string literals:

An alternative approach could be to not allow single quotes in a raw string literal. While this would simplify the raw-string grammar rule, it would severely limit the usability of the raw-string rule, forcing users to use the literal rule.

The JEP-12 proposed grammar was missing an escape for \, but it also lacked support for \ appearing inside a raw string literal at all other than as part of a \' escape sequence, which was not fixed until over a year after adoption when https://github.com/jmespath/jmespath.site/commit/52b91d9e53904c37f059547799130252fde0ec3f introduced preserved-escape to align it with a '\u03a6' example in the then-current jmespath.site[^1] introduced by the JEP-12 spec commit and test in the then-current jmespath.test introduced by the corresponding JEP-12 test commit. And the https://github.com/jmespath/jmespath.test/commit/c0f7923dd1a091b209bd9b4247a89de7442b0a60 commit associated with that introduction of preserved-escape is the source of the incorrect "expression": "'\\\\'" test which doesn't even agree with the search('\\', "") -> "\\" example introduced in the main repository by the same author at the same time.

[^1]: the example I just corrected in https://github.com/jmespath-community/jmespath.spec/pull/132

All things considered, I think we have a clear picture that raw string literals were intended to fully subsume the expressiveness of JSON string backtick literals and treat backslashes literally except for the two specific cases of \' and \\, which are respectively escape sequences for ' and \. And since supporting control characters such as line feed and even NULL is technically harmless, I think this community fork might as well honor that original intent. Concretely, it would mean

expanding %x20-26 to %x00-26 in both raw-string-char and preserved-escape productions, and
updating tests/literal.json to correct the "expression": "'\\\\'" test to "result": "\\" (i.e., #15) and add new tests for raw U+0000 through U+001F.

springcomp commented 1 year ago

This makes me realize that JEP-12 while having been accepted is actually incomplete. Thanks for pointing out all the subequent commits that collectively allude to the thought process at the time.

I think an addendum is required for JEP-12

springcomp commented 1 year ago

However, allowing \\ as a valid escape sequence does lend itself to the leaning toothpick syndrome that JEP-12 was explicitely trying to guard against… 🤔

So I’m starting to be convinced that the "expression": "\\\\" test is not incorrect after all. I think the intended grammar should have been:

raw-string        = "'" *raw-string-char "'"
; The first grouping matches any character other than "'" and "\"
raw-string-char   = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape /raw-string-escape
; The second grouping matches any character other than "'"
preserved-escape  = escape (%x20-26 / %x28-10FFFF)
raw-string-escape = escape "'"

Making those expressions valid raw-string literals:

'␊Hello␊World␊'  -> "\nHello\nWorld\n"
'Null Terminated String␀'-> "Null Terminated String\u0000"
'Awesome, isn\'t it?'-> "Awesome, isn't it?"
'\p\r\e\s\e\r\v\e\d' -> "\\p\\r\\e\\s\\e\\r\\v\\e\\d" (emphasis on '\r' -> "\\r" instead of "\r")
'preserving reverse \\ solidus' -> "preserving reverse \\ solidus"

That said, I’m not comfortable allowing C0 control characters in a raw-string as there would not be any easy way to write them in the first place…

gibson042 commented 1 year ago

What would be the point of a \' escape to support representation of literal ' without a corresponding \\ escape to support literal representation of \? Your last example even presumes it:

'preserving reverse \\ solidus' -> "preserving reverse \\ solidus"

(which in the absence of special treatment for \\ like raw-string-escape = escape ("'" / escape) would instead have a result like "preserving reverse \\\\ solidus")

That said, I’m not comfortable allowing C0 control characters in a raw-string as there would not be any easy way to write them in the first place…

Maybe not all of them, but remember that tab and line feed are included in that set. And it's also not too hard to get text containing ANSI escape sequences (which start with 0x1B).

springcomp commented 1 year ago

OK, I agree that the current grammar is consistent with the last example from that page:

search( '\\', "" ) -> "\\"

That example clearly illustrates usage of two consecutive backslash characters to escape a single resulting backslash character in the output string.

springcomp commented 1 year ago

That said, I’m not comfortable allowing C0 control characters in a raw-string as there would not be any easy way to write them in the first place…

Maybe not all of them, but remember that tab and line feed are included in that set. And it's also not too hard to get text containing ANSI escape sequences (which start with 0x1B).

To be clear, I’m explicitly referring to exanding the set of characters from %x20-26 to %x00-26. This is what I’m mostly uncomfortable with.

I can’t see a good use case for making an exception here in raw string and not in, say quoting-string expressions.

gibson042 commented 1 year ago

I can’t see a good use case for making an exception here in raw string and not in, say quoting-string expressions.

I can make that case: quoted-string is aligned with JSON and in particular already supports the representation of arbitrary code points by means of \u escapes, whereas raw-string supports escapes only for its delimiter and escape prefix and would otherwise be unable to represent code points from U+0000 through U+001F, limiting usability in a way that was rejected in JEP-12 Alternative approaches and (in the case of U+000A LINE FEED) explicitly intended per JEP-12 Examples.

springcomp commented 1 year ago

(in the case of U+000A LINE FEED) explicitly intended per JEP-12 Examples.

I rest my case your honor 😏

springcomp commented 1 year ago

For the record, to fix this issue, the specification needs to be updated to:

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x00-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

I must admit that I do not understand the purpose of the preserved-escape rule🤔

gibson042 commented 1 year ago

I must admit that I do not understand the purpose of the preserved-escape rule

I think I can explain it... the raw-string-escape rule exists as a place to attach the behavior by which a backslash escapes a following backslash or single quote, and backslash must therefore not be allowed in the parent raw-string-char. And given the (however misguided) desire to treat literally any backslash that is not followed by one of those two characters, the preserved-escape rule is introduced to match and implement precisely that lack of escaping. Basically, the contents of a raw string are an arbitrarily long sequence in which each element is one of the following, collectively allowing representation of any code point sequence expressible in the outer encoding:

the grouped terminal alternatives of raw-string-char: any code point other than ' or \, representing itself
raw-string-escape: \' or \\, respectively representing ' or \
preserved-escape: \ followed by a code point that is neither ' or \, representing the sequence consisting of both code points in order

springcomp commented 1 year ago

I think this issue has been solved.

jmespath-community / jmespath.test

raw string literal testing disagrees with documentation #14