Support for Unicode property escapes (and `/u` flag)

no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

BSD 3-Clause "New" or "Revised" License

824 stars 66 forks source link

Support for Unicode property escapes (and `/u` flag) #116

Closed modernserf closed 5 years ago

modernserf commented 5 years ago

ES2018 added support for unicode property escapes. This allows you to match complex unicode ranges (e.g. chars valid in identifiers) much more compactly than with explicit unicode ranges. For example, this regex matches all valid JS identifiers:

let re = /[$_\p{ID_Start}][$\p{ID_Continue}]*/u
let foo = re.test("foo")
let $π123 = re.test("$π123")
let ভরা = re.test("ভরা")

Compare with the regex used by acorn:

https://github.com/acornjs/acorn/blob/2ffed00236071aece0a79813b98c36f302ff1f9d/acorn/src/identifier.js#L22-L31

However, this requires the /u flag, which is currently forbidden: https://github.com/no-context/moo/blob/13e115756d667f7d41204a04c396ea8a29a4342d/moo.js#L44-L49

I presume the /u flag was disabled because it added complexity to the implementation but (previously) had no significant advantages; however, I believe that these new property escapes would make proper unicode support in grammars built with moo dramatically simpler.

It has pretty good support in current browsers and with Babel. I have no idea what the performance implications of the /u flag are, but I would expect that support could be implemented as purely opt-in.

tjvr commented 5 years ago

Hi! I definitely appreciate why the Unicode flag is useful 😊

Moo builds a single RegExp which combines all of the tokens, so the flags effectively have to be the same for all of your tokens.

Out of interest, since Babel already supports compiling RegExps with the unicode flag, if you use Babel with Moo I imagine it would "just work"... care to try? :)

modernserf commented 5 years ago

since Babel already supports compiling RegExps with the unicode flag, if you use Babel with Moo I imagine it would "just work"

It depends on what environment you're targeting with Babel. If you're targeting es5, it works, since it generates a RegExp without the /u flag.

However if you're targeting environments that support the /u flag (and may or may not support \p properties) it doesn't work.

And, of course, it doesn't work without Babel.

In my branch, I apply the /u flag to the big RegExp if any of the constituent RegExps also use the /u flag. It might be safer to have a more explicit opt-in for unicode, since it affects how every pattern is interpreted, but I'm not sure what the actual implications of that are.

If you think this is worth discussing further, I can submit my branch as a PR, and we can continue the discussion there.

nathan commented 5 years ago

I think it makes more sense to enable the u flag if every constituent regex has the u flag and to forbid mixing regexes with different flags. E.g., this would work:

moo.compile({
  id: /[$_\p{ID_Start}][$\p{ID_Continue}]*/u,
  plus: '+',
  ws: /\p{WSpace}+/u,
})

and this would work:

moo.compile({
  id: /[$_a-zA-Z][$\w]*/,
  plus: '+',
  ws: /\s+/,
})

But this would not:

moo.compile({
  id: /[$_\p{ID_Start}][$\p{ID_Continue}]*/u,
  plus: '+',
  mostOfBmp: /./,
})

That seems like a good thing, because adding the u flag to the mostOfBmp regex changes its meaning.

Importantly, a string converted to a regular expression does not change its meaning when the u flag is added, so this is a less objectionable feature than adding i if every regular expression has i—since that would also have an effect on tokens expressed as strings.

tjvr commented 5 years ago

I agree with Nathan, I was going to suggest the same thing. If you'd like to PR this that would be great :)

_{Sent with GitHawk}