Open peblair opened 3 years ago
I'm currently working on emitting JSON in Grain and for escaping I need to generate an UTF-16 surrogate pair from a unicode codepoint. And of course vice-versa, but parsing is still a long way off.
@cician My question is a tad unrelated to this issue, but I'm not sure I understand—for what you're trying to accomplish, why do you need to make surrogate pairs? Grain strings are UTF-8.
Actually I don't strictly need it because only ASCII codes 0-31 need to be escaped for conforming JSON output in UTF-8, but I've tentatively added an option to escape all non ASCII characters.
The ECMA-404 spec (https://www.ecma-international.org/publications-and-standards/standards/ecma-404/) says the escaping should be done in UTF-16 pairs, unless I misunderstand something. I'm learning in the process about both unicode and Grain. I think it's a consequence of the fact that JSON inherits some properties from JavaScript, which doesn't use UTF-8 internally. It spills to how escaping is done in JavaScript strings and thus JSON.
PS: I'm working on it here.
Ah I see, it's the specification for unicode character escapes that appear within JSON object strings. Got it. That's interesting! So you'd want a utility like Char.escapeSurrogatePair : Char -> String
that would take a char and return its unicode escape as a surrogate pair, e.g. assert Char.escapeSurrogatePair('𝄞') == "\\uD834\\uDD1E"
? That'd differ from Char.escape
which would just produce "\\u{1D11E}"
for regular Grain strings, yeah?
Or I guess it could just be called escapeUtf16
.
For now I've just copied a few lines from OpenJDK's source to do the job, but I should probably remove it to avoid copyright/licensing issues.
I don't think escapeUtf16 makes much sense as a standalone function as opposed to be part of the JSON specific code, unless we want to build a library like this: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.
In java's standard library there are simply two functions like this:
char highSurrogate(int codePoint);
char lowSurrogate(int codePoint);
In Grain it woudn't make sense to return Char though. These would rather be just numbers with its own specific meaning in unicode slang.
@peblair I am currently trying to implement the unicode aware functions by generating code based on the Unicode data files (for example https://unicode.org/Public/UNIDATA/UnicodeData.txt). This results in several thousand lines of Map.set code and I read in the Contributing instructions that it should all be contained in a single file. Can I extract the code to another file for readability purposes, or should I just put it all in the char file?
@FinnRG Thanks for doing some work on this! I think it would make sense to have the data in a separate file, but we may want to hold off on the effort briefly. Once #1330 lands, we will have a more coherent way of working with WASM data
sections in Grain, which I think can give us a much more efficient way of storing the data in UnicodeData.txt
(that way we avoid having thousands of Map.set
calls on startup).
Rust has this little tool for generating efficent bitsets and functions from the spec. https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator I think with a minimal amount of work we could have this generate grain code instead.
It would be useful for
Char
to support a variety of Unicode-aware query functions and conversion functions (toUpper
,isPunctuation
, etc). For example, these are the ones supported by Racket here and here.We should try to add as many of these as possible, as one never knows what might be useful for libraries.