kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.57k stars 231 forks source link

Enable the RegExp `u` flag #631

Open markandrus opened 1 year ago

markandrus commented 1 year ago

I had a problem very similar to the one mentioned here in https://github.com/kach/nearley/issues/543. I'm trying to adapt some of the grammars from the ECMAScript standard. For example, here is (part of) the grammar for IdentifierName:

# https://tc39.es/ecma262/#sec-identifier-names

IdentifierName  -> IdentifierStart               {% id %}
                |  IdentifierName IdentifierPart {% xs => xs.join('') %}
IdentifierStart -> IdentifierStartChar           {% id %}
IdentifierPart  -> IdentifierPartChar            {% id %}

IdentifierStartChar -> UnicodeIDStart    {% id %}
                    |  "$"               {% id %}
                    |  "_"               {% id %}
IdentifierPartChar  -> UnicodeIDContinue {% id %}
                    |  "$"               {% id %}
                    |  ZWNJ              {% id %}
                    |  ZWJ               {% id %}

ZWNJ -> "\u200C" {% id %}
ZWJ  -> "\u200D" {% id %}

UnicodeIDStart    -> [\p{ID_Start}]    {% id %}
UnicodeIDContinue -> [\p{ID_Continue}] {% id %}

Crucially, UnicodeIDStart and UnicodeIDContinue are defined in terms of the Unicode properties. We need the \p{ID_Start} and \p{ID_Continue} syntax to work in the RegExp-based charclasses; however, to do that, we also need to enable the u flag.

I'm a very new user of Nearley, so I don't know if it's safe to turn this on for everyone, if it should be opt-in, or if it could cause other problems. What do you think? Is this useful?