Char Unicode Data and Conversions

grain-lang / grain

The Grain compiler toolchain and CLI. Home of the modern web staple. 🌾

https://grain-lang.org/

GNU Lesser General Public License v3.0

3.29k stars 115 forks source link

Char Unicode Data and Conversions #661

Open peblair opened 3 years ago

peblair commented 3 years ago

It would be useful for Char to support a variety of Unicode-aware query functions and conversion functions (toUpper, isPunctuation, etc). For example, these are the ones supported by Racket here and here.

We should try to add as many of these as possible, as one never knows what might be useful for libraries.

cician commented 3 years ago

I'm currently working on emitting JSON in Grain and for escaping I need to generate an UTF-16 surrogate pair from a unicode codepoint. And of course vice-versa, but parsing is still a long way off.

ospencer commented 3 years ago

@cician My question is a tad unrelated to this issue, but I'm not sure I understand—for what you're trying to accomplish, why do you need to make surrogate pairs? Grain strings are UTF-8.

cician commented 3 years ago

Actually I don't strictly need it because only ASCII codes 0-31 need to be escaped for conforming JSON output in UTF-8, but I've tentatively added an option to escape all non ASCII characters.

The ECMA-404 spec (https://www.ecma-international.org/publications-and-standards/standards/ecma-404/) says the escaping should be done in UTF-16 pairs, unless I misunderstand something. I'm learning in the process about both unicode and Grain. I think it's a consequence of the fact that JSON inherits some properties from JavaScript, which doesn't use UTF-8 internally. It spills to how escaping is done in JavaScript strings and thus JSON.

PS: I'm working on it here.

ospencer commented 3 years ago

Ah I see, it's the specification for unicode character escapes that appear within JSON object strings. Got it. That's interesting! So you'd want a utility like Char.escapeSurrogatePair : Char -> String that would take a char and return its unicode escape as a surrogate pair, e.g. assert Char.escapeSurrogatePair('𝄞') == "\\uD834\\uDD1E"? That'd differ from Char.escape which would just produce "\\u{1D11E}" for regular Grain strings, yeah?

ospencer commented 3 years ago

Or I guess it could just be called escapeUtf16.

cician commented 3 years ago

For now I've just copied a few lines from OpenJDK's source to do the job, but I should probably remove it to avoid copyright/licensing issues.

I don't think escapeUtf16 makes much sense as a standalone function as opposed to be part of the JSON specific code, unless we want to build a library like this: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

In java's standard library there are simply two functions like this:

char highSurrogate(int codePoint);
char lowSurrogate(int codePoint);

In Grain it woudn't make sense to return Char though. These would rather be just numbers with its own specific meaning in unicode slang.

FinnRG commented 2 years ago

@peblair I am currently trying to implement the unicode aware functions by generating code based on the Unicode data files (for example https://unicode.org/Public/UNIDATA/UnicodeData.txt). This results in several thousand lines of Map.set code and I read in the Contributing instructions that it should all be contained in a single file. Can I extract the code to another file for readability purposes, or should I just put it all in the char file?

peblair commented 2 years ago

@FinnRG Thanks for doing some work on this! I think it would make sense to have the data in a separate file, but we may want to hold off on the effort briefly. Once #1330 lands, we will have a more coherent way of working with WASM data sections in Grain, which I think can give us a much more efficient way of storing the data in UnicodeData.txt (that way we avoid having thousands of Map.set calls on startup).

spotandjake commented 11 months ago

Rust has this little tool for generating efficent bitsets and functions from the spec. https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator I think with a minimal amount of work we could have this generate grain code instead.