Closed modernserf closed 5 years ago
Hi! I definitely appreciate why the Unicode flag is useful 😊
Moo builds a single RegExp which combines all of the tokens, so the flags effectively have to be the same for all of your tokens.
Out of interest, since Babel already supports compiling RegExps with the unicode flag, if you use Babel with Moo I imagine it would "just work"... care to try? :)
since Babel already supports compiling RegExps with the unicode flag, if you use Babel with Moo I imagine it would "just work"
It depends on what environment you're targeting with Babel. If you're targeting es5, it works, since it generates a RegExp without the /u
flag.
However if you're targeting environments that support the /u
flag (and may or may not support \p
properties) it doesn't work.
And, of course, it doesn't work without Babel.
In my branch, I apply the /u
flag to the big RegExp if any of the constituent RegExps also use the /u
flag. It might be safer to have a more explicit opt-in for unicode, since it affects how every pattern is interpreted, but I'm not sure what the actual implications of that are.
If you think this is worth discussing further, I can submit my branch as a PR, and we can continue the discussion there.
I think it makes more sense to enable the u
flag if every constituent regex has the u
flag and to forbid mixing regexes with different flags. E.g., this would work:
moo.compile({
id: /[$_\p{ID_Start}][$\p{ID_Continue}]*/u,
plus: '+',
ws: /\p{WSpace}+/u,
})
and this would work:
moo.compile({
id: /[$_a-zA-Z][$\w]*/,
plus: '+',
ws: /\s+/,
})
But this would not:
moo.compile({
id: /[$_\p{ID_Start}][$\p{ID_Continue}]*/u,
plus: '+',
mostOfBmp: /./,
})
That seems like a good thing, because adding the u
flag to the mostOfBmp
regex changes its meaning.
Importantly, a string converted to a regular expression does not change its meaning when the u
flag is added, so this is a less objectionable feature than adding i
if every regular expression has i
—since that would also have an effect on tokens expressed as strings.
ES2018 added support for unicode property escapes. This allows you to match complex unicode ranges (e.g. chars valid in identifiers) much more compactly than with explicit unicode ranges. For example, this regex matches all valid JS identifiers:
Compare with the regex used by acorn:
https://github.com/acornjs/acorn/blob/2ffed00236071aece0a79813b98c36f302ff1f9d/acorn/src/identifier.js#L22-L31
However, this requires the
/u
flag, which is currently forbidden: https://github.com/no-context/moo/blob/13e115756d667f7d41204a04c396ea8a29a4342d/moo.js#L44-L49I presume the
/u
flag was disabled because it added complexity to the implementation but (previously) had no significant advantages; however, I believe that these new property escapes would make proper unicode support in grammars built with moo dramatically simpler.It has pretty good support in current browsers and with Babel. I have no idea what the performance implications of the
/u
flag are, but I would expect that support could be implemented as purely opt-in.