google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.34k stars 212 forks source link

scanner: accept \uXXXX and \UXXXXXXXX escapes (and specify them!) #222

Closed alandonovan closed 3 years ago

alandonovan commented 5 years ago

Strings are quoted as if by Go's fmt %q operator, which quotes non-printable Unicode code points using \uXXXX or \UXXXXXXXX. But this syntax is not currently recognized by the Starlark scanner, nor does the spec say anything about the form of string literals.

Welcome to Starlark (go.starlark.net)
>>> chr(0x00A0) # NO-BREAK SPACE (non printable)
"\u00a0"
>>> "\u00a0"
"\\u00a0"
>>> chr(0x400) # CYRILLIC CAPITAL LETTER IE WITH GRAVE
"Ѐ"
>>> '\u0x400'
"\\u0x400"
>>> chr(0x0001f63f) # CRYING CAT FACE
"😿"
>>> '\U0001f63f'
"\\U0001f63f"

Contrast with Python3:

Python 3.6.5 (default, Mar 31 2018, 05:34:57) 
>>> chr(0x00A0) # NO-BREAK SPACE (non printable)
'\xa0'
>>> '\xa0'
'\xa0'
>>> chr(0x400) # CYRILLIC CAPITAL LETTER IE WITH GRAVE
'Ѐ'
>>> '\u0400'
'Ѐ'
>>> '\U0001f63f'
'😿'
>>> chr(0x0001f63f) # CRYING CAT FACE
'😿'

The Starlark spec and implementations should allow \uXXXX and \UXXXXXXXX escapes within strings, with exactly 4 or 8 hex digits.

Python2 & 3 also accept \xXX escapes, with two hex digits. Should Starlark? (FWIW: C++ and Go do too; Java does not, and furthermore its \UXXXX notation denotes a UTF-16 code, not a Unicode code point.)