Unicode support RFC - Githubissues

glebm commented 7 years ago

This RFC proposes new WaxEye grammar to support Unicode character escapes, ranges, and classes.

It also discusses possible implementations for C and JavaScript.

Context

Currently (prior to this RFC), Unicode-support in the grammar is limited and language dependent. For example, specifying a Unicode character and then compiling to C will result in invalid C code, and compiling to JavaScript will only result in valid code if the Unicode character is < 3 bytes wide. No grammar exists for specifying unicode character escapes, ranges, and classes. Sequences of single byte character can be used to emulate unicode character escapes but they only work in C (see #30).

This RFC's goal is to make specifying encoding-independent Unicode in the grammar possible and easy.

Unicode in the grammar

Unicode characters in the grammar are supported, e.g. GreekVar <- [α-ο] is now valid. The notion of a "character" / "unit" in the grammar, such as what is matched by ., is now defined as a single Unicode codepoint.

Unicode character classes

Update: We decided to not support \p Unicode property testing. Instead, these will be expressed like regular non-terminals in the grammar, and the parser generator will be optimized to support large character classes efficiently.

The proposed syntax is a subset of the ICU RegExp syntax.

Backwards-incompatible grammar changes:

A literal backslash (\) must be escaped within the grammar.
\<7F>-style escapes are now \x{7F} (case-insensitive, \x{7f} is also valid). The \<7F>-style escapes are no longer supported.

Supported character class features

The grammar is extended to support some Unicode character escapes and character classes.

Character	Description
`\x{hhhh}`	Match the character with hex value `hhhh`. From one to six hex digits may be supplied.
`\t`	Match a HORIZONTAL TABULATION, `\x{0009}`.
`\n`	Match a LINE FEED, `\x{000A}`.
`\r`	Match a CARRIAGE RETURN, `\x{000D}`.
`\p{UNICODE PROPERTY NAME}`	Match any character with the specified Unicode Property.

Set expressions

Set expressions are supported, except:

Negation is not supported (^, \P). Use regular waxeye negation outside of the character class instead.
Posix-like syntax for properties ([:script=Greek:]) is not supported.

The following set expressions are supported:

Example	Description
`[abc]`	Match any of the characters a, b or c
`[A-M]`	Range - match any character from A to M. The characters to include are determined by Unicode code point ordering.
`[\x{0000}-\x{10ffff}]`	Also a range.
`[\p{Letter}]` `[\p{General_Category=Letter}]` `[\p{L}]`	Characters with Unicode Category = Letter. All forms shown are equivalent.
`[\p{numeric_value=9}]`	Match all numbers with a numeric value of 9. Any Unicode Property may be used in set expressions.

Stretch goal:

Example	Description
`[\p{Letter}&&\p{script=cyrillic}]`	Logical AND or intersection. Match the set of all Cyrillic letters.
`[\p{Letter}--\p{script=latin}]`	Subtraction. Match all non-Latin letters.

Unicode word characters

\w is not supported as it is easily confused with [A-Za-z_]. To match a Unicode word character, use:

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

Unicode digits

\d is not supported as it is easily confused with [0-9]. To match a Unicode digit, use:

\p{Decimal_Number}

Unicode whitespace

\s is not supported. To match a Unicode whitespace, use:

[\t\r\n\x{000C}\p{Z}]

Unicode newline

\R is not supported because it matches a sequence of codepoints. To match a Unicode newline, use:

\r\n | [\x{000a}-\x{000d}\x{0085}\x{2028}\x{2029}]

Case-insensitive matching

Case-insensitive Unicode matching (") is the simple matching. In Unicode lingo, "simple matching" means matching a single character at a time, so e.g. "fußball" does not match "FUSSBALL".

Language support

Overall:

We don't have to implement all the generators at once. The generators for which this is not yet yet supported should fail at build time on encountering new grammar.
Native strings are used in the public API where possible.
The parser now needs to iterate over Unicode codepoints, not characters (e.g. in JavaScript a single codepoint can be up to two String characters long).
The AST nodes currently have a field that indicates the token position in the input. The value of that field should still be the native string index.
Not all languages / libraries support all types of properties for Unicode property classes. Generally, all support \p{General_Category=} and \p{script=}. Some support \p{block=}. Few support anything else natively. Support for this per-language should be best-effort.

C

The UTF-8 encoding is rather easy to work with without any dependencies, so the existing the existing runtime can be modified to assume it.

Character classes would require some optimization. We have an idea about this described in a comment below.

JavaScript

Unicode regular expressions are supported since ES6.

However, \p character classes are not yet supported (proposal status: https://github.com/tc39/proposal-regexp-unicode-property-escapes).

With this proposal, \p escapes can be produced at grammar generation-time using the unicode-10.0.0 repository of Unicode in JavaScript at together with the regenerate JavaScript Unicode character class generator.

Example:

const regenerate = require('regenerate');
const codePoints = require('unicode-10.0.0/General_Category/Letter/regex.js');
const set = regenerate(codePoints);
set.toString({ 'hasUnicodeFlag': true });
// 7 KiB string → '[A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377...-\u{2B734}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}]'

At build time, regular expressions are embedded into the parser using the process above.

The runtime will also need to change to iterate over codepoints instead of characters, because a single UTF-8 codepoint in JavaScript may be represented using up to two UTF-16 characters.

Luckily, ES6 supports both codepoint iteration and addressing, for example:

let str = 'a\uD83D\uDE80b'; // a🚀b

// Iterate over codepoints:
for (let c of str) {
    console.log(c.length); // 1 2 1
}

// Address codepoints:
console.log(str.codePointAt(0).toString(16)); // 61
console.log(str.codePointAt(1).toString(16)); // 1f680
console.log(str.codePointAt(3).toString(16)); // 62

// Split codepoints:
[...str] // ["a", "🚀", "b"]

The runtime will change to use these slower but correct methods. It may be possible to implemented these so that the performance impact is minimal if the parsed string is mostly in the Basic Multilingual Plane.

Other languages

Java - uses UTF-16 internally, so similar to JavaScript, need to switch to using codepoint-based methods. Supports POSIX character classes. Supports the following Unicode property classes: script, block, and General_Category. Supports &&, and -- can be emulated (using [a]&&[^b]). POSIX-like character classes, such as \p{Space}, are ASCII-only.
Ruby - String is Unicode-aware by default, String#each_char iterates over codepoints, and String#[] index is a codepoint index. Supports the following property classes: script (only \p{Arabic} syntax), General_Category (only p{L} shorthand syntax). -- and && are not supported. POSIX-like character classes, such as \p{Space}, are Unicode.
Python 2 - No support for \p Unicode character classes. Not sure what to do here. Fail on new grammar for now.
Racket - Not sure. Fail on new grammar for now.
SML - No need to implement, as that code is for documenting the derivation of the new parsing machine.

/cc @orlandohill @ddrone @jishi9 @adabru

orlandohill commented 7 years ago

Excellent!

I agree with most of this.

Unicode character classes

I'm fine with dropping the old hex syntax entirely. The project is still in the early stages of user adoption. Better to force a single syntax.

C

It's probably too early to have two C runtimes. The current version still needs to be moved to the new parsing machine design. Let's stick to one, and move it to Unicode. An ASCII-only version could always be developed later, if there's a need.

JavaScript

If calling codePointAt on long strings is slow, perhaps all the code points can be got once using the iterator.

Other languages

SML - No need to worry about SML. That code is for documenting the derivation of the new parsing machine.

adabru commented 7 years ago

Unicode properties and their combinations can be expressed with grammar symbols I think. With the disadvantage of varying support between languages maybe they are not worth the implementation?

orlandohill commented 7 years ago

Good point, Adam.

Since Waxeye has modular grammars, Unicode properties could be defined in one or more sub-grammars, and reused by other grammars when needed.

Best to stick to the goal of "write once, parse anywhere".

orlandohill commented 7 years ago

With Unicode properties defined as non-terminals, could the intersection and subtraction examples be rewritten as follows?

&Letter &ScriptCyrillic .
&Letter !ScriptLatin .

The generator could have an optimization to inline non-recursive void type non-terminals, so overhead shouldn't be an issue.

glebm commented 7 years ago

I'm fine with dropping the old hex syntax entirely.

Updated.

It's probably too early to have two C runtimes.

Updated the C runtime section.

If calling codePointAt on long strings is slow, perhaps all the code points can be got once using the iterator.

I was mistaken about how codePointAt works, it's actually fast as it's accepts a string index as an argument (not the codepoint index).

Unicode properties as non-terminals

Unicode properties and their combinations can be expressed with grammar symbols I think. -- @adabru

Since Waxeye has modular grammars, Unicode properties could be defined in one or more sub-grammars, and reused by other grammars when needed. -- @orlandohill

The reason I originally suggested built-in expressions for this, is that these non-terminal definitions would be huge. There is well over 1MiB of codepoints for just the Letter definition.

After giving some more thought to this, I think there is a way to make this work.

JavaScript / Ruby / etc

For languages not "close to the metal", testing this many characters directly would be very slow, and would bloat the output massively. For these languages, we can convert all character classes to regular expressions.

These regular expressions will still be a bit large. The optimized Letter Regex in JavaScript is 7KiB in ES5, and still almost as large in ES6 (which fully supports Unicode escapes for characters outside of the Basic Multilingual Plane).

Eventually, perhaps the generator for a given language could even generate a \p{Letter} when optimizing its output.

C

For C, none of the above is a problem. Actually, this allows us to support C without requiring any dependencies like ICU.

Basic idea: Only support UTF-8 and parse it manually (it's easy).

We do need to optimize character class matching in C as well, or we'll get 10MiB parser.c files.

Luckily, UTF-8 is a length-limited prefix-free code, so it is possible to generate the optimal "matching strategy" with a bit of effort.

A proof of concept of this (done by @jishi9) takes the 10MiB parser.c file to < 100KiB, and that's not even applying all the possible optimizations, so I'm optimistic about going with this.

adabru commented 7 years ago

For languages not "close to the metal", testing this many characters directly would be very slow, and would bloat the output massively. For these languages, we can convert all character classes to regular expressions.

Regular expression substitution for character classes: also sequences, alternations, and, not, etc. can be converted to regexes where (part of) the children are regex substitutable. When optimizing a waxeye-similar parser in Javascript I wrote, the substitution with regexes (https://github.com/adabru/adabru-parser/blob/7b800a89390054c148ef9ed3f3023ed04ae19da6/abpv1.ls#L337-L380) yielded the best optimization effect (~30% faster) among my other efforts. I support that converting character classes to regex will yield a performance benefit.

jishi9 commented 7 years ago

With Unicode properties defined as non-terminals, could the intersection and subtraction examples be rewritten as follows?
&Letter &ScriptCyrillic .
&Letter !ScriptLatin .
Not quite, since a Unicode character can be between 1 and 4 bytes - see the table at: https://en.wikipedia.org/wiki/UTF-8#Description

It's easy to write a rule matching a code point though, and then it would look something like this:

&Letter &ScriptCyrillic CodePoint
&Letter !ScriptLatin CodePoint

Or I suppose it could even be written as:

&ScriptCyrillic Letter
!ScriptLatin Letter

glebm commented 7 years ago

@jishi9 The new grammar describes Unicode codepoints, not UTF-8 bytes, so it is possible.

glebm commented 7 years ago

Closing this and opening separate issues for the individual action items as discussed.

waxeye-org / waxeye

Unicode support RFC #31

Context

Unicode in the grammar

Unicode character classes

Supported character class features

Set expressions

Unicode word characters

Unicode digits

Unicode whitespace

Unicode newline

Case-insensitive matching

Language support

C

JavaScript

Other languages

Unicode character classes

C

JavaScript

Other languages

Unicode properties as non-terminals

JavaScript / Ruby / etc

C