Allow unicode characters outside the Basic Multilingual Plane

dylanahsmith commented 7 years ago

Currently a GraphQL document is only allows a SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/ and EscapedUnicode :: /[0-9A-Fa-f]{4}/ also prevents unicode characters above U+FFFF from being included into a GraphQL string.

Unicode code points are actually in the range 0 to 0x10FFFF. For example, unicode emoji characters like 😀 (U+1F600) have code points above U+FFFF.

Is there any reason why the source document doesn't allow unicode characters above U+FFFF? Or can we remove that restriction? Without that restriction the limitation of the unicode escape doesn't seem problematic.

If supporting a unicode escape for all unicode characters is desired, then one way of handling that is the way swift supports unicode escapes:

An arbitrary Unicode scalar, written as \u{n}, where n is a 1–8 digit hexadecimal number with a value equal to a valid Unicode code point

chris-morgan commented 7 years ago

I also was reading the spec and realised this. Given the paragraph around it:

GraphQL documents are expressed as a sequence of Unicode characters. However, with few exceptions, most of GraphQL is expressed only in the original non‐control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control.

It sounds to me an error of ignorance rather than intent.

Rust formerly had \uXXXX and \UXXXXXXXX, but changed to \u{x} (the same as Swift does) some time before Rust 1.0.

JSON (which is probably the main inspiration for the GraphQL syntax) does \uXXXX and uses the abomination that is UTF-16 surrogate pairs as a way of representing higher-order characters, e.g. U+1F600 (😀) is escaped as "\ud83d\ude00" in JSON.

Fortunately you can avoid that insanity by simply expressing values literally. There’s no real need for the escapes anyway once you get past U+001F (\u00XX) and U+0022 (\"). (Unless you deal with combining characters that will attach to a string’s quotation marks, which is fearfully ugly and points out the grammatical problem of parsing by codepoint rather than grapheme cluster, but this is all more advanced stuff that we wish wouldn’t happen in real life, anyway.)

Also currently the escapes listed (EscapedCharacter) match those of JSON. (I think. As to the interpretation, what the GraphQL spec actually says is that \f would be U+0066, “f”, rather than U+000C which is what we all know it’s supposed to be. It’s really badly written.) Given that general tie, supporting \uXXXX might not be a terrible idea, with or without \u{X}.

The definition of the handling of EscapedUnicode is also extremely tacky, with spelling errors, poorly defined terms, &c.:

Return the character value represented by the UTF16 hexidecimal identifier EscapedUnicode.

What does that even mean? Seriously, that doesn’t make sense.

This stuff all suggests to me that it was written by someone with a poor understanding of Unicode. This spec gravely needs both editorial and technical review.

I want to see how different implementations parse:

"\ud83d\ude00": nonsensical in the current specification. If GraphQL wants to be like JSON, handling it as UTF-16 surrogate pairs is probably a good idea. If not (please don’t go for surrogate pairs!), the grammar needs to be changed to allow for the supplemental planes (such as via \u{1F600}).
"😀": illegal in the current specification, shouldn’t tokenise. However, I hope that implementations accept it and treat it as a string containing the code point U+1F600.

leebyron commented 7 years ago

Thanks for bringing this up! Great thought process already happening.

I agree that surrogate pairs is an obtuse API. I'd like to avoid it if possible, though there is one serious upside to consider: it mirrors JSON. That might not be enough to motivate it as the solution, but it certainly shouldn't be discredited.

Here are some action items:

[x] Editing of the language section describing Unicode to correct and clarify.
[x] Propose expanding the parsable character set to all represented in latest Unicode including supplemental planes.
[ ] Propose a new escape sequence for string literals (or prescribe to always use Unicode characters directly) for supplemental planes.

leebyron commented 7 years ago

@dylanahsmith and @chris-morgan I'd love your feedback on #231

Nabellaleen commented 5 years ago

:+1: for this spec' !

To be able to build an enum like

enum MOOD {
  😩
  😞
  😕
  😐
  🙂
  😃
}

chris-morgan commented 5 years ago

@Nabellaleen Allowing emoji in an identifier is a completely different thing from allowing it in a source document, which is mostly for the sake of strings. And allowing emoji in identifiers is generally a poor idea; most languages stick with UAX #31’s definition for identifiers.

andimarek commented 5 years ago

I would be interested to revive this discussion: I don't see a reason for restricting it and we already see implementations and APIs having descriptions with emojis in it (e.g. github API).

andimarek commented 5 years ago

fyi: graphql-ruby supports all unicode chars (cc @rmosolgo) and we decided to do the same for GraphQL Java.

lumberman commented 4 years ago

Build fail if you have emoji in the path.

andimarek commented 4 years ago

I created a new issue which outlines proposed changes to the spec to allow for full unicode support: #687

graphql / graphql-spec

Allow unicode characters outside the Basic Multilingual Plane #214