Escape sequences for "invalid" characters

shirok commented 8 years ago

Since we added r7rs support, \x escape sequence in a string and a character literal is interpreted as unicode codepoint in r7rs compatible mode. It loses capability of writing raw octet value in the native encoding, when the native encoding is sjis or eucjp.

We can still use legacy mode to interpret \xNN as a raw octet, but for the better compatibility and read/write invariance, it might be better to introduce another escape sequence that specifically indicates raw octets. (It's especially important in incomplete string literals. Such raw octet representation allows writing portable incomplete string literal.)

qykth-git commented 8 years ago

Here is my idea of those raw embedded octets.

\x and ; comes from R7RS, and *comes from incomplete string notation. Last ; is required to split embedded octets and normal characters. Use , to embed 2 or more octets at once.

;; one octet
"\x*1;" "\x*2;" "\x*a;" "\x*b;"
"\x*01;" "\x*02;" "\x*0a;" "\x*0b;"
"\x*01;\x*02;\x*03;"

;; multiple octets
"\x*1,2,3,4,a,b,c,d;"
"\x*01,23,34,ab,cd,ef;"

;; embed raw octets among normal charactors
"foo\x*ab,cd,1,2,3,04,05,06;bar"
"\x0041;\x61;\x*41,61;" ;; => "A", "a", embedded 0x41, embedded 0x61

;; long line concatenation
   "foo\x*ab,cd,\
    01,02,03,\
    ab,cd;bar"

;; those are all embedded 0x01
"\x*1;"
"\x*001;"
"\x*000001;"

;; error
"\x*1ff;" ;; error: octet exceeds 0xff

shirok commented 8 years ago

I like it.

shirok / Gauche

Escape sequences for "invalid" characters #187