sebastienros / esprima-dotnet

Esprima .NET (BSD license) is a .NET port of the esprima.org project. It is a standard-compliant ECMAScript parser (also popularly known as JavaScript).
BSD 3-Clause "New" or "Revised" License
425 stars 75 forks source link

Fix capturing group numbering and a few more issues in regex parsing #392

Closed adams85 closed 1 year ago

adams85 commented 1 year ago

The JS regex engine assigns numbers to capturing groups sequentially (regardless of the group being named or not named) but .NET uses a different, weird approach:

[...] Captures that use parentheses are numbered automatically from left to right based on the order of the opening parentheses in the regular expression, starting from 1. However, named capture groups are always ordered last, after non-named capture groups. [...]

This could totally mess up numbered backreferences and replace pattern references. (See also https://github.com/sebastienros/jint/pull/1603#issuecomment-1658140789). So, as a workaround, we wrap all named capturing groups in a non-named capturing group to force .NET to include all capturing groups in the resulting match in the expected order. (For example /.(?<a>a)(.)\1/ will be converted to @".((?<a>a))(.)\1", which make it work as expected).

Of course, this won't prevent named groups from being listed after the numbered ones but we can't really do anything about that other than returning the actual count of groups (RegExpParseResult.ActualRegexGroupCount). Using this information, consumers can discard the irrelevant groups. We also provide a method (RegExpParseResult.GetRegexGroupName) for querying the group name by number, which can be used as a replacement for Regex.GroupNameFromNumber.

This means that we can no longer get away with storing a plain Regex object in the AST or returning it from Scanner.AdaptRegExp. We need to put a wrapper (RegExpParseResult) around the Regex object to be able to store the related information. (RegExpParseResult also allows us to improve error reporting a bit.) How do you feel about this API change? (BTW, I tried to adjust Jint to this change and could do that without major problems. I'll open a PR over there soon so you can evaluate that side of the "equation" as well.)

Also fixed a minor bug related to lone surrogate matching. (Test262 test language/literals/regexp/u-surrogate-pairs-atom-escape-decimal.js will now pass.)

adams85 commented 1 year ago

Thanks for the review!