Closed gibson042 closed 1 year ago
While the example from the specification you are referring to is indeed incorrect, the spirit of the raw-string
literal is basically to accept any character as is. It only processes the following two escape sequences:
I think you have addressed this point in the following pull-request.
However I do not see any conflicting examples in the tests/literal.json you linked to.
May I suggest that you are incorrectly interpreting the example you are referring to which in fact, does not contain any backslash character at all, but indeed contains a linefeed? 🤔
Expression | Expressed as JSON | Result | Comment |
---|---|---|---|
'␊' | '\n' | "\n" |
U+000A LINE FEED | ||
'\u03a6' | '\\u03a6' | "\\u03a6" |
Do not interpret escaped Unicode | ||
'foo\'bar' | 'foo\\'bar' | "foo'bar" |
Can escape the single quote | ||
'\z' | '\\z' | "\\z" |
Backslash not followed by single quote is treated as any other character", | ||
'\\' | '\\\\' | "\\\\" |
Backslash not followed by single quote is treated as any other character", |
While the example from the specification you are referring to is indeed incorrect, the spirit of the
raw-string
literal is basically to accept any character as is. It only processes the following two escape sequences:
- backslash followed by simple quote -> simple quote.
- backslash followed by itself -> backslash.
That makes sense to me, but the second bullet point is contradicted by the last test case at https://github.com/jmespath-community/jmespath.test/blob/aa6fb5f942e852af1781d1f208c4069a15794a80/tests/literal.json#L183-L197 . If backslash followed by backslash were an escape sequence representing a single backslash, then that case would instead look more like "Can escape the single quote":
{
- "comment": "Backslash not followed by single quote is treated as any other character",
+ "comment": "Can escape backslash",
"expression": "'\\\\'",
- "result": "\\\\"
+ "result": "\\"
}
May I suggest that you are incorrectly interpreting the example you are referring to which in fact, does not contain any backslash character at all, but indeed contains a linefeed?
I don't think so, and I'll demonstrate using your table:
Expression | Expressed as JSON | Result | Comment | Response |
---|---|---|---|---|
'␊' |
'\n' |
"\n" |
U+000A LINE FEED | not currently allowed by raw-string-char = (%x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape |
'\u03a6' |
'\\u03a6' |
"\\u03a6" |
Do not interpret escaped Unicode | ✔ |
'foo\'bar' |
'foo\\'bar' |
"foo'bar" |
Can escape the single quote | ✔ |
'\z' |
'\\z' |
"\\z" |
Backslash not followed by single quote is treated as any other character | ✔ |
'\\' |
'\\\\' |
"\\\\" |
Backslash not followed by single quote is treated as any other character | JSON text "\\\\" expresses a string consisting of two consecutive backslashes rather than a string consisting of just a single backslash, so this result is analogous to the preceding '\z' ↔︀︁︎︎︎ `"\\z"` rather than to the '\'' ↔️︎ `"'"` that I think we both consider preferable |
I realize that control characters are not currently specified as valid for raw-string
literals. Thanks for pointing this out. Indeed it seems we need to update the grammar rule accordingly. (#16)
Now I understand the last example as well. The JMESPath expression is a raw-string
that contains two consecutive backslashes, which represents how a single backslash is escaped. Thus the resulting string should be a single backslash, which in JSON is represented by a sequence of two backslashes 🤔😏 (#15)
I agree with your assessment.
Should we really allow control characters in raw-string
literals 🤔 ?
Like:
raw-string = "'" *raw-string-char "'"
raw-string-char = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape
preserved-escape = escape (%x00-26 / %x28-5B / %x5D-10FFFF)
raw-string-escape = escape ("'" / escape)
That feels somehow... Uncanny 😲
Or maybe just a subset of those supported as valid escape sequences in JSON:
raw-string = "'" *raw-string-char "'"
raw-string-char = (%x08-09 / %x0A / %x0C-0D / %x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape
preserved-escape = escape (%x08-09 / %x0A / %x0C-0D / %x20-26 / %x28-5B / %x5D-10FFFF)
raw-string-escape = escape ("'" / escape)
I would prefer to prohibit C0 control characters in raw strings, but if any are allowed then they all should be (and I don't know your comfort level with backwards-incompatible changes to the test suite, even if they are already supported by the formal grammar).
Alternatively, if it is a goal for all possible strings to be representable as raw literals (which also implies interpretation of \\
as an escape sequence for U+005C REVERSE SOLIDUS \
, making '\\'
equivalent to `"\\"`
and `"\u005C"`
), then C0 control characters should be allowed.
I agree to forbid usage of control character in JMESPath expressions. Seems that was never the intended spirit when introduced in JEP-12
Seems that was never the intended spirit when introduced in JEP-12
Oh, thanks for digging that up! Its examples actually contain an explicit rebuttal:
- A raw string literal that contains new lines:
'foo bar baz!'
The above expression would be parsed as a string that contains new lines:
foo baz bar!
And even more usefully, the "Disallow single quotes in a raw string" alternative documents its intent to fully subsume JSON string literals:
An alternative approach could be to not allow single quotes in a raw string literal. While this would simplify the
raw-string
grammar rule, it would severely limit the usability of theraw-string
rule, forcing users to use theliteral
rule.
The JEP-12 proposed grammar was missing an escape for \
, but it also lacked support for \
appearing inside a raw string literal at all other than as part of a \'
escape sequence, which was not fixed until over a year after adoption when https://github.com/jmespath/jmespath.site/commit/52b91d9e53904c37f059547799130252fde0ec3f introduced preserved-escape
to align it with a '\u03a6'
example in the then-current jmespath.site[^1] introduced by the JEP-12 spec commit and test in the then-current jmespath.test introduced by the corresponding JEP-12 test commit. And the https://github.com/jmespath/jmespath.test/commit/c0f7923dd1a091b209bd9b4247a89de7442b0a60 commit associated with that introduction of preserved-escape
is the source of the incorrect "expression": "'\\\\'"
test which doesn't even agree with the search('\\', "") -> "\\"
example introduced in the main repository by the same author at the same time.
[^1]: the example I just corrected in https://github.com/jmespath-community/jmespath.spec/pull/132
All things considered, I think we have a clear picture that raw string literals were intended to fully subsume the expressiveness of JSON string backtick literals and treat backslashes literally except for the two specific cases of \'
and \\
, which are respectively escape sequences for '
and \
. And since supporting control characters such as line feed and even NULL is technically harmless, I think this community fork might as well honor that original intent. Concretely, it would mean
%x20-26
to %x00-26
in both raw-string-char
and preserved-escape
productions, and"expression": "'\\\\'"
test to "result": "\\"
(i.e., #15) and add new tests for raw U+0000 through U+001F.This makes me realize that JEP-12 while having been accepted is actually incomplete. Thanks for pointing out all the subequent commits that collectively allude to the thought process at the time.
I think an addendum is required for JEP-12
However, allowing \\
as a valid escape sequence does lend itself to the leaning toothpick syndrome that JEP-12 was explicitely trying to guard against… 🤔
So I’m starting to be convinced that the "expression": "\\\\"
test is not incorrect after all.
I think the intended grammar should have been:
raw-string = "'" *raw-string-char "'"
; The first grouping matches any character other than "'" and "\"
raw-string-char = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape /raw-string-escape
; The second grouping matches any character other than "'"
preserved-escape = escape (%x20-26 / %x28-10FFFF)
raw-string-escape = escape "'"
Making those expressions valid raw-string
literals:
'␊Hello␊World␊'
-> "\nHello\nWorld\n"
'Null Terminated String␀'
-> "Null Terminated String\u0000"
'Awesome, isn\'t it?'
-> "Awesome, isn't it?"
'\p\r\e\s\e\r\v\e\d'
-> "\\p\\r\\e\\s\\e\\r\\v\\e\\d"
(emphasis on '\r'
-> "\\r"
instead of "\r"
)'preserving reverse \\ solidus'
-> "preserving reverse \\ solidus"
That said, I’m not comfortable allowing C0 control characters in a raw-string
as there would not be any easy way to write them in the first place…
What would be the point of a \'
escape to support representation of literal '
without a corresponding \\
escape to support literal representation of \
? Your last example even presumes it:
'preserving reverse \\ solidus'
->"preserving reverse \\ solidus"
(which in the absence of special treatment for \\
like raw-string-escape = escape ("'" / escape)
would instead have a result like "preserving reverse \\\\ solidus"
)
That said, I’m not comfortable allowing C0 control characters in a
raw-string
as there would not be any easy way to write them in the first place…
Maybe not all of them, but remember that tab and line feed are included in that set. And it's also not too hard to get text containing ANSI escape sequences (which start with 0x1B).
OK, I agree that the current grammar is consistent with the last example from that page:
search( '\\', "" ) -> "\\"
That example clearly illustrates usage of two consecutive backslash characters to escape a single resulting backslash character in the output string.
That said, I’m not comfortable allowing C0 control characters in a
raw-string
as there would not be any easy way to write them in the first place…Maybe not all of them, but remember that tab and line feed are included in that set. And it's also not too hard to get text containing ANSI escape sequences (which start with 0x1B).
To be clear, I’m explicitly referring to exanding the set of characters from %x20-26
to %x00-26
.
This is what I’m mostly uncomfortable with.
I can’t see a good use case for making an exception here in raw string and not in, say quoting-string
expressions.
I can’t see a good use case for making an exception here in raw string and not in, say
quoting-string
expressions.
I can make that case: quoted-string
is aligned with JSON and in particular already supports the representation of arbitrary code points by means of \u
escapes, whereas raw-string
supports escapes only for its delimiter and escape prefix and would otherwise be unable to represent code points from U+0000 through U+001F, limiting usability in a way that was rejected in JEP-12 Alternative approaches and (in the case of U+000A LINE FEED) explicitly intended per JEP-12 Examples.
(in the case of U+000A LINE FEED) explicitly intended per JEP-12 Examples.
I rest my case your honor 😏
For the record, to fix this issue, the specification needs to be updated to:
raw-string = "'" *raw-string-char "'"
raw-string-char = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape
preserved-escape = escape (%x00-26 / %x28-5B / %x5D-10FFFF)
raw-string-escape = escape ("'" / escape)
I must admit that I do not understand the purpose of the preserved-escape
rule🤔
I must admit that I do not understand the purpose of the
preserved-escape
rule
I think I can explain it... the raw-string-escape
rule exists as a place to attach the behavior by which a backslash escapes a following backslash or single quote, and backslash must therefore not be allowed in the parent raw-string-char
. And given the (however misguided) desire to treat literally any backslash that is not followed by one of those two characters, the preserved-escape
rule is introduced to match and implement precisely that lack of escaping. Basically, the contents of a raw string are an arbitrarily long sequence in which each element is one of the following, collectively allowing representation of any code point sequence expressible in the outer encoding:
raw-string-char
: any code point other than '
or \
, representing itselfraw-string-escape
: \'
or \\
, respectively representing '
or \
preserved-escape
: \
followed by a code point that is neither '
or \
, representing the sequence consisting of both code points in orderI think this issue has been solved.
jmespath.site Grammar defines the grammar for raw string literals like
and Raw String Literals includes a
search('\\', "") -> "\\"
example implying that\\
is an escape sequence representing a single U+005C REVERSE SOLIDUS just like\'
is an escape sequence representing a single U+0027 APOSTROPHE.However, tests/literal.json in this repository includes test cases contradicting the above, such as
{ "expression": "'\n'", "result": "\n" }
(a raw string literal including a U+000A LINE FEED, which is not covered byraw-string-char
) and{ "comment": "Backslash not followed by single quote is treated as any other character", "expression": "'\\\\'", "result": "\\\\" }
(added in 2016 by c0f7923dd1a091b209bd9b4247a89de7442b0a60).Both disagreements ultimately affect whether or not raw string literals are capable of representing all possible sequences of code points (respectively those including C0 control characters and those including U+005C REVERSE SOLIDUS), and the latter is especially odd—I've never heard of any other escaping approach in which the escape prefix itself is unrepresentable.