Allow use of Unicode property escapes to match a character

mojavelinux commented 1 year ago

In a character expression (e.g., [a-z]), I would like to be able to use a Unicode property escape (i.e., Unicode Character Category) to express the group of characters to match. The reason for this request is to parse input that contains reserved syntax that's not limited to the ASCII character set.

For example, I could define a rule to match any alpha character as defined by Unicode using the following parsing expression:

alpha = [\p{Alpha}]

I would only expect the property escape to be passed through to the underlying regular expression. Peggy would just need to allow for the \p{...} and \P{...} sequence to be used inside the square brackets of a character expression in the grammar file. Additionally, the "u" flag must be added to the regular expression.

In fact, we see that even peggy's own grammar language has such a need: https://github.com/peggyjs/peggy/blob/main/src/parser.pegjs#L476C2-L530 While I'm not suggesting that Peggy itself use these escape sequences, it would be beneficial for users of Peggy to be able to make use of them, certainly more reasonable than having to maintain all those categories.

mojavelinux commented 1 year ago

I have been able to patch in support for Unicode property escapes using a Peggy plugin. Here's the quick and dirty code to do that:

'use strict'

function rewriteRegExps (node) {
  const children = node.children
  for (const [idx, child] of children.entries()) {
    if (typeof child === 'string') {
      if (child.includes('var peg$r') && child.includes('p{')) {
        // we are looking for that pattern "p{...}"
        children[idx] = child.replace(/^( *var peg\$r\d+ .*? )(\/.*p\{.+?\}.*\/)(;.*)/gm, (match, before, rx, after) => {
          return before + rx.replace(/(?!<\\)p\{.+?\}/g, '\\$&') + 'u' + after
        })
        break
      }
    } else {
      rewriteRegExps(child)
    }
  }
}

module.exports = {
  use (config, options) {
    config.passes.generate.push((ast) => {
      rewriteRegExps(ast.code)
    })
  }
}

hildjj commented 1 year ago

As a quick workaround, you can use:

alpha = char:. &{ return char.match(/^\p{Alpha}$/u) }

mojavelinux commented 1 year ago

I actually prefer the workaround using a plugin, which is actually quite a nice feature to tap into for workarounds like this. Since these character classes show up all over the grammar, using semantic predicates simply make the grammar too difficult to read.

hildjj commented 1 year ago

See #378. If we can generate good modern code for people that want it, I'm much more interested in taking this functionality into the core of Peggy.

mojavelinux commented 1 year ago

I'm left scratching my head trying to figure out what your last comment is referring to. In case I caused confusion, I wasn't suggesting that my plugin be accepted into the core of Peggy. I was just saying I think it's a cleaner approach as a workaround in the interim.

What I'm requesting is for the Peggy grammar parser to permit Unicode property escapes in a character expression. We know that the grammar already excepts escapes for certain literals such as \n, Unicode escapes like \u00a0, and ranges like a-z. What I'm proposing is to extend that to Unicode property escapes, which are far more powerful and more concise (the peggy grammar being the case in point).

hildjj commented 1 year ago

I understand what you want, and I want it too. In order to use Unicode escapes, you have to have a late enough JS implementation that supports them. That's going to cause us some backward-compatibility work.

mojavelinux commented 1 year ago

Cool. Sounds like we're on the same page.

Regarding backward-compatibility, what I'm thinking is that if you use them, that's an indication that you want them. I don't think there's any expectation that if you use them, that the parser will work if you use a version of JS/Node.js that doesn't support them. Trying to put in shim would be an overreach.

hildjj commented 1 year ago

What if it's a warning, unless you're doing output type "es"?

mojavelinux commented 1 year ago

That wouldn't be ideal for me since I use Node.js 16/18 with commonjs. The transition to es has been too bumpy in my view and so I stick with the commonjs format.

A possibly compromise would be a compliance setting, something akin to what eslint does. That way, there's a mechanism to communicate to the compiler that it can use/permit certain ECMAScript features. Something like "--compliance-level=es5" or whatever.

Having said that, every modern browser and active Node version supports Unicode property escapes in regular expressions. So I caution against overthinking this.

hildjj commented 1 year ago

Nod, solid argument. Thinking some more.

reverofevil commented 1 year ago

I actually prefer the workaround using a plugin

I just wanted to point out that regular expression language is not regular, and thus cannot be parsed with regular expressions. \p{, \\p{ and \\\p{ have different meaning depending on number of \, and the only correct way to do that transform is to actually add Unicode property escapes into peggy's grammar.

In order to use Unicode escapes, you have to have a late enough JS implementation that supports them.

This is an another case of "it's on codegen side", and for already mentioned reasons I'd rather not think too hard about checking this stuff right now even for peggy's own JS codegen.

peggyjs / peggy

Allow use of Unicode property escapes to match a character #375