toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.5k stars 851 forks source link

\uXXXX escaping only covers Basic Multilingual Plane. #179

Closed tobyink closed 10 years ago

tobyink commented 11 years ago

U+0000 to U+FFFF is only a small portion of the Unicode space. It's the so-called "basic multilingual plane" (Plane 0), but Unicode defines 17 different planes, supporting characters all the way up to U+10FFFF.

Plane 1 (U+10000 to U+1FFFF) contains a bunch of mostly historic scripts (e.g. Egyptian hieroglyphs) but also a lot of mathematical and musical notation symbols.

Many Chinese, Japanese and Korean characters are in Plane 2 (U+20000 to U+2FFFF).

tobyink commented 11 years ago

FYI, many languages that have \uXXXX escapes also allow an eight-hex-digit escape like \UXXXXXXXX (note capital U).

Python does; Turtle does.

tokuhirom commented 11 years ago

:+1: I need correct spec for this.

pnathan commented 11 years ago

:+1:

ambv commented 11 years ago

+1 for "\U01234567" because in TOML you cannot use a surrogate pair like the JSON spec suggests for this case, because those are invalid byte sequences in UTF-8.

Alternatively, you can store all Unicode characters directly, TOML is UTF-8.

Leont commented 11 years ago

+1 for "\U01234567" because in TOML you cannot use a surrogate pair like the JSON spec suggests for this case, because those are invalid byte sequences in UTF-8.

Javascript (or rather ECMAscript) is defined in such a way that it's effectively forcing any implementation to be UTF-16 based. JSON inherited some of that, making life for everyone else difficult.

I don't think TOML should repeat that mistake, if only because invariably most utf8-based implementations will get it wrong.

anders commented 10 years ago

Lua 5.3 (work) has \u{xxxxx}.

hax commented 8 years ago

ES6 also use \u{xxxxx} and I think it's much better than \UXXXXXXXX. But maybe it's too late...