Closed tobyink closed 10 years ago
FYI, many languages that have \uXXXX
escapes also allow an eight-hex-digit escape like \UXXXXXXXX
(note capital U).
Python does; Turtle does.
:+1: I need correct spec for this.
:+1:
+1 for "\U01234567"
because in TOML you cannot use a surrogate pair like the JSON spec suggests for this case, because those are invalid byte sequences in UTF-8.
Alternatively, you can store all Unicode characters directly, TOML is UTF-8.
+1 for "\U01234567" because in TOML you cannot use a surrogate pair like the JSON spec suggests for this case, because those are invalid byte sequences in UTF-8.
Javascript (or rather ECMAscript) is defined in such a way that it's effectively forcing any implementation to be UTF-16 based. JSON inherited some of that, making life for everyone else difficult.
I don't think TOML should repeat that mistake, if only because invariably most utf8-based implementations will get it wrong.
Lua 5.3 (work) has \u{xxxxx}.
ES6 also use \u{xxxxx}
and I think it's much better than \UXXXXXXXX
.
But maybe it's too late...
U+0000 to U+FFFF is only a small portion of the Unicode space. It's the so-called "basic multilingual plane" (Plane 0), but Unicode defines 17 different planes, supporting characters all the way up to U+10FFFF.
Plane 1 (U+10000 to U+1FFFF) contains a bunch of mostly historic scripts (e.g. Egyptian hieroglyphs) but also a lot of mathematical and musical notation symbols.
Many Chinese, Japanese and Korean characters are in Plane 2 (U+20000 to U+2FFFF).