Chunk unicode similar to spread syntax and less like regex

frostburn commented 6 months ago

The way unicode codepoints are chunked in Peggy patterns doesn't follow visual chunking.

Example grammar:

sharp.peggy

SharpAccidental = [#x♯𝄪]

Unexpected result:

$ npx peggy sharp.peggy -t 'x'
'x'
$ npx peggy sharp.peggy -t '𝄪'
Error running test
Error: Expected end of input but "�" found.
 --> command line:1:2
  |
1 | 𝄪
  |  ^

Reasonable behavior in node using spread syntax:

> [...'#x♯𝄪']
[ '#', 'x', '♯', '𝄪' ]

hildjj commented 6 months ago

I think the actionable part here is allowing non-BMP characters in a character class to work as expected. The current behavior is useless, so backward-compatibility with older Peggy/peg.js versions is not needed. The rule given above gets generated with:

var peg$r0 = /^[#x\u266F\uD834\uDD2A]/;
var peg$e0 = peg$classExpectation(["#", "x", "\u266F", "\uD834", "\uDD2A"], false, false);

Both of which are wrong. This should generate:

var peg$r0 = /^[#x\u266F\uD834\uDD2A]/u; // JS backward-compat issue!
var peg$e0 = peg$classExpectation(["#", "x", "\u266F", "\uD834\uDD2A"], false, false);

or:

var peg$r0 = /^[#x\u266F\u{1d12a}]/u; // JS backward-compat issue!
var peg$e0 = peg$classExpectation(["#", "x", "\u266F", "\u{1d12a}"], false, false);

(see: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode)

Alternately, this could be automatically turned into the following rule:

SharpAccidental = [#x♯] / "𝄪"

If we are dropping IE11 support in 4.0, the first solution is probably better, but we should only put a /u at the end of regeular expressions that need it; there may be performance changes at the least.

reverofevil commented 6 months ago

I'm unsure if conditional /u won't lead to unexpected behavior. IMO it should be a generator option, and as such can be already supported. I'd rather change to /u by default in 4.0.

hildjj commented 6 months ago

Let's talk about browser support in #463. Let's do some benchmarking before making a final decision on implementation approach?

peggyjs / peggy

Chunk unicode similar to spread syntax and less like regex #462