Closed glebm closed 7 years ago
Excellent!
I agree with most of this.
I'm fine with dropping the old hex syntax entirely. The project is still in the early stages of user adoption. Better to force a single syntax.
It's probably too early to have two C runtimes. The current version still needs to be moved to the new parsing machine design. Let's stick to one, and move it to Unicode. An ASCII-only version could always be developed later, if there's a need.
If calling codePointAt
on long strings is slow, perhaps all the code points can be got once using the iterator.
SML - No need to worry about SML. That code is for documenting the derivation of the new parsing machine.
Unicode properties and their combinations can be expressed with grammar symbols I think. With the disadvantage of varying support between languages maybe they are not worth the implementation?
Good point, Adam.
Since Waxeye has modular grammars, Unicode properties could be defined in one or more sub-grammars, and reused by other grammars when needed.
Best to stick to the goal of "write once, parse anywhere".
With Unicode properties defined as non-terminals, could the intersection and subtraction examples be rewritten as follows?
&Letter &ScriptCyrillic .
&Letter !ScriptLatin .
The generator could have an optimization to inline non-recursive void type non-terminals, so overhead shouldn't be an issue.
I'm fine with dropping the old hex syntax entirely.
Updated.
It's probably too early to have two C runtimes.
Updated the C runtime section.
If calling codePointAt on long strings is slow, perhaps all the code points can be got once using the iterator.
I was mistaken about how codePointAt works, it's actually fast as it's accepts a string index as an argument (not the codepoint index).
Unicode properties and their combinations can be expressed with grammar symbols I think. -- @adabru
Since Waxeye has modular grammars, Unicode properties could be defined in one or more sub-grammars, and reused by other grammars when needed. -- @orlandohill
The reason I originally suggested built-in expressions for this, is that these non-terminal definitions would be huge. There is well over 1MiB of codepoints for just the Letter
definition.
After giving some more thought to this, I think there is a way to make this work.
For languages not "close to the metal", testing this many characters directly would be very slow, and would bloat the output massively. For these languages, we can convert all character classes to regular expressions.
These regular expressions will still be a bit large. The optimized Letter
Regex in JavaScript is 7KiB in ES5, and still almost as large in ES6 (which fully supports Unicode escapes for characters outside of the Basic Multilingual Plane).
Eventually, perhaps the generator for a given language could even generate a \p{Letter}
when optimizing its output.
For C, none of the above is a problem. Actually, this allows us to support C without requiring any dependencies like ICU.
Basic idea: Only support UTF-8 and parse it manually (it's easy).
We do need to optimize character class matching in C as well, or we'll get 10MiB parser.c files.
Luckily, UTF-8 is a length-limited prefix-free code, so it is possible to generate the optimal "matching strategy" with a bit of effort.
A proof of concept of this (done by @jishi9) takes the 10MiB parser.c file to < 100KiB, and that's not even applying all the possible optimizations, so I'm optimistic about going with this.
For languages not "close to the metal", testing this many characters directly would be very slow, and would bloat the output massively. For these languages, we can convert all character classes to regular expressions.
Regular expression substitution for character classes: also sequences, alternations, and, not, etc. can be converted to regexes where (part of) the children are regex substitutable. When optimizing a waxeye-similar parser in Javascript I wrote, the substitution with regexes (https://github.com/adabru/adabru-parser/blob/7b800a89390054c148ef9ed3f3023ed04ae19da6/abpv1.ls#L337-L380) yielded the best optimization effect (~30% faster) among my other efforts. I support that converting character classes to regex will yield a performance benefit.
With Unicode properties defined as non-terminals, could the intersection and subtraction examples be rewritten as follows?
&Letter &ScriptCyrillic . &Letter !ScriptLatin .
Not quite, since a Unicode character can be between 1 and 4 bytes - see the table at: https://en.wikipedia.org/wiki/UTF-8#Description
It's easy to write a rule matching a code point though, and then it would look something like this:
&Letter &ScriptCyrillic CodePoint
&Letter !ScriptLatin CodePoint
Or I suppose it could even be written as:
&ScriptCyrillic Letter
!ScriptLatin Letter
@jishi9 The new grammar describes Unicode codepoints, not UTF-8 bytes, so it is possible.
Closing this and opening separate issues for the individual action items as discussed.
This RFC proposes new WaxEye grammar to support Unicode character escapes, ranges, and classes.
It also discusses possible implementations for C and JavaScript.
Context
Currently (prior to this RFC), Unicode-support in the grammar is limited and language dependent. For example, specifying a Unicode character and then compiling to C will result in invalid C code, and compiling to JavaScript will only result in valid code if the Unicode character is < 3 bytes wide. No grammar exists for specifying unicode character escapes, ranges, and classes. Sequences of single byte character can be used to emulate unicode character escapes but they only work in C (see #30).
This RFC's goal is to make specifying encoding-independent Unicode in the grammar possible and easy.
Unicode in the grammar
Unicode characters in the grammar are supported, e.g.
GreekVar <- [α-ο]
is now valid. The notion of a "character" / "unit" in the grammar, such as what is matched by.
, is now defined as a single Unicode codepoint.Unicode character classes
Update: We decided to not support
\p
Unicode property testing. Instead, these will be expressed like regular non-terminals in the grammar, and the parser generator will be optimized to support large character classes efficiently.The proposed syntax is a subset of the ICU RegExp syntax.
Backwards-incompatible grammar changes:
\
) must be escaped within the grammar.\<7F>
-style escapes are now\x{7F}
(case-insensitive,\x{7f}
is also valid). The\<7F>
-style escapes are no longer supported.Supported character class features
The grammar is extended to support some Unicode character escapes and character classes.
\x{hhhh}
hhhh
. From one to six hex digits may be supplied.\t
\x{0009}
.\n
\x{000A}
.\r
\x{000D}
.\p{UNICODE PROPERTY NAME}
Set expressions
Set expressions are supported, except:
^
,\P
). Use regular waxeye negation outside of the character class instead.[:script=Greek:]
) is not supported.The following set expressions are supported:
[abc]
[A-M]
[\x{0000}-\x{10ffff}]
[\p{Letter}]
[\p{General_Category=Letter}]
[\p{L}]
[\p{numeric_value=9}]
Stretch goal:
[\p{Letter}&&\p{script=cyrillic}]
[\p{Letter}--\p{script=latin}]
Unicode word characters
\w
is not supported as it is easily confused with[A-Za-z_]
. To match a Unicode word character, use:Unicode digits
\d
is not supported as it is easily confused with[0-9]
. To match a Unicode digit, use:Unicode whitespace
\s
is not supported. To match a Unicode whitespace, use:Unicode newline
\R
is not supported because it matches a sequence of codepoints. To match a Unicode newline, use:Case-insensitive matching
Case-insensitive Unicode matching (
"
) is the simple matching. In Unicode lingo, "simple matching" means matching a single character at a time, so e.g. "fußball" does not match "FUSSBALL".Language support
Overall:
\p{General_Category=}
and\p{script=}
. Some support\p{block=}
. Few support anything else natively. Support for this per-language should be best-effort.C
The UTF-8 encoding is rather easy to work with without any dependencies, so the existing the existing runtime can be modified to assume it.
Character classes would require some optimization. We have an idea about this described in a comment below.
JavaScript
Unicode regular expressions are supported since ES6.
However,
\p
character classes are not yet supported (proposal status: https://github.com/tc39/proposal-regexp-unicode-property-escapes).With this proposal,
\p
escapes can be produced at grammar generation-time using the unicode-10.0.0 repository of Unicode in JavaScript at together with the regenerate JavaScript Unicode character class generator.Example:
At build time, regular expressions are embedded into the parser using the process above.
The runtime will also need to change to iterate over codepoints instead of characters, because a single UTF-8 codepoint in JavaScript may be represented using up to two UTF-16 characters.
Luckily, ES6 supports both codepoint iteration and addressing, for example:
The runtime will change to use these slower but correct methods. It may be possible to implemented these so that the performance impact is minimal if the parsed string is mostly in the Basic Multilingual Plane.
Other languages
script
,block
, andGeneral_Category
. Supports&&
, and--
can be emulated (using[a]&&[^b]
). POSIX-like character classes, such as\p{Space}
, are ASCII-only.String
is Unicode-aware by default,String#each_char
iterates over codepoints, andString#[]
index is a codepoint index. Supports the following property classes:script
(only\p{Arabic}
syntax),General_Category
(onlyp{L}
shorthand syntax).--
and&&
are not supported. POSIX-like character classes, such as\p{Space}
, are Unicode.\p
Unicode character classes. Not sure what to do here. Fail on new grammar for now./cc @orlandohill @ddrone @jishi9 @adabru