Closed timbray closed 1 year ago
I note that the syntax includes
hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
non-surrogate = ((DIGIT / "A"/"B"/"C" / "E"/"F") 3HEXDIG) /
("D" %x30-37 2HEXDIG )
high-surrogate = "D" ("8"/"9"/"A"/"B") 2HEXDIG
low-surrogate = "D" ("C"/"D"/"E"/"F") 2HEXDIG
But I'm not sure these constructs are actually used?
hexchar
is referenced towards the end of the previous production:
escapable = ( %x62 / %x66 / %x6E / %x72 / %x74 / ; \b \f \n \r \t
; b / ; BS backspace U+0008
; t / ; HT horizontal tab U+0009
; n / ; LF line feed U+000A
; f / ; FF form feed U+000C
; r / ; CR carriage return U+000D
"/" / ; / slash (solidus) U+002F
"\" / ; \ backslash (reverse solidus) U+005C
(%x75 hexchar) ; uXXXX U+XXXX
)
From this, the only \uXXXX
escapes which are currently supported have precisely four hex digits, thus ruling out \u1F600
(which has five hex digits). This is the same as JSON.
So the spec is specific and it forces you to use unescaped characters for emoji etc.
I think it should be stated explicitly that the rules are exactly the same as JSON.
I also view JSON as being stupid and basically wrong on this issue and wonder why we have to stick with exactly 4 digits - JSONPath isn't JSON. Consistency with existing implementations?
3.5.1 has this:
Note:
double-quoted
strings follow the JSON string syntax ({{Section 7 of RFC8259}});single-quoted
strings follow an analogous pattern ({{syntax-index}}).
This is only a note, and is not explicitly repeated in other places making use of strings, but it seems to indicate the hexchar syntax is in actual use. BTW, hexchar is non-surrogate / (high-surrogate "\" %x75 low-surrogate)
, so the comment on the place where it is used is misleading.
JSON just inherited the atrocious JavaScript syntax for this; we can't really blame JSON for following its main design rule (don't invent where JavaScript already has something defined). I sure wouldn't mind a more 20xx syntax for beyond-BMP characters as an alternative; where would we steal that from?
JavaScript (ES2020) has \u{ CodePoint }
, where CodePoint :: HexDigits but only if MV of HexDigits ≤ 0x10FFFF
.
This aligns with other languages I have seen.
Is there any JSONPath implementation that already does this?
ES2011 (5.1) didn't have \u{nnn}
, ES2015 (6) seems to have introduced it.
They allow leading zeroes (duh).
2023-01-10 Interim: That would be innovation, biased to JavaScript. Rough consensus: no innovation.
Let's add a note saying that the syntax is the same as JSON.
2023-01-10 Interim: Leave a note that this is exactly the same as JSON.
In 3.5.1, Name Selector, it discusses the \uXXXX escape. Is this to be interpreted exactly as in JSON, i.e. if I want to refer to a character outside the Basic Multilingual Plane, I have to compose two UTF-16 codepoints? Or might I be able to say \u1F600 for an emoji smiley?
The spec should be specific about this.