Unicode escapes in Name Selector

ietf-wg-jsonpath / draft-ietf-jsonpath-base

Development of a JSONPath internet draft

https://ietf-wg-jsonpath.github.io/draft-ietf-jsonpath-base/

Other

59 stars 20 forks source link

Unicode escapes in Name Selector #272

Closed timbray closed 1 year ago

timbray commented 1 year ago

In 3.5.1, Name Selector, it discusses the \uXXXX escape. Is this to be interpreted exactly as in JSON, i.e. if I want to refer to a character outside the Basic Multilingual Plane, I have to compose two UTF-16 codepoints? Or might I be able to say \u1F600 for an emoji smiley?

The spec should be specific about this.

timbray commented 1 year ago

I note that the syntax includes

hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
non-surrogate = ((DIGIT / "A"/"B"/"C" / "E"/"F") 3HEXDIG) /
                 ("D" %x30-37 2HEXDIG )
high-surrogate = "D" ("8"/"9"/"A"/"B") 2HEXDIG
low-surrogate = "D" ("C"/"D"/"E"/"F") 2HEXDIG

But I'm not sure these constructs are actually used?

glyn commented 1 year ago

hexchar is referenced towards the end of the previous production:

escapable           = ( %x62 / %x66 / %x6E / %x72 / %x74 / ; \b \f \n \r \t
                          ; b /         ;  BS backspace U+0008
                          ; t /         ;  HT horizontal tab U+0009
                          ; n /         ;  LF line feed U+000A
                          ; f /         ;  FF form feed U+000C
                          ; r /         ;  CR carriage return U+000D
                          "/" /          ;  /  slash (solidus) U+002F
                          "\" /          ;  \  backslash (reverse solidus) U+005C
                          (%x75 hexchar) ;  uXXXX      U+XXXX
                      )

From this, the only \uXXXX escapes which are currently supported have precisely four hex digits, thus ruling out \u1F600 (which has five hex digits). This is the same as JSON.

So the spec is specific and it forces you to use unescaped characters for emoji etc.

timbray commented 1 year ago

I think it should be stated explicitly that the rules are exactly the same as JSON.

I also view JSON as being stupid and basically wrong on this issue and wonder why we have to stick with exactly 4 digits - JSONPath isn't JSON. Consistency with existing implementations?

cabo commented 1 year ago

3.5.1 has this:

Note: double-quoted strings follow the JSON string syntax ({{Section 7 of RFC8259}}); single-quoted strings follow an analogous pattern ({{syntax-index}}).

This is only a note, and is not explicitly repeated in other places making use of strings, but it seems to indicate the hexchar syntax is in actual use. BTW, hexchar is non-surrogate / (high-surrogate "\" %x75 low-surrogate), so the comment on the place where it is used is misleading.

cabo commented 1 year ago

JSON just inherited the atrocious JavaScript syntax for this; we can't really blame JSON for following its main design rule (don't invent where JavaScript already has something defined). I sure wouldn't mind a more 20xx syntax for beyond-BMP characters as an alternative; where would we steal that from?

cabo commented 1 year ago

JavaScript (ES2020) has \u{ CodePoint }, where CodePoint :: HexDigits but only if MV of HexDigits ≤ 0x10FFFF. This aligns with other languages I have seen. Is there any JSONPath implementation that already does this?

cabo commented 1 year ago

ES2011 (5.1) didn't have \u{nnn}, ES2015 (6) seems to have introduced it. They allow leading zeroes (duh).

cabo commented 1 year ago

2023-01-10 Interim: That would be innovation, biased to JavaScript. Rough consensus: no innovation.

glyn commented 1 year ago

Let's add a note saying that the syntax is the same as JSON.

cabo commented 1 year ago

2023-01-10 Interim: Leave a note that this is exactly the same as JSON.