tc39 / proposal-regexp-match-indices

ECMAScript RegExp Match Indices
https://arai-a.github.io/ecma262-compare/?pr=1713
BSD 3-Clause "New" or "Revised" License
64 stars 13 forks source link

RegExp Match Indices Stage-3 performance mitigation #47

Closed rbuckton closed 3 years ago

rbuckton commented 3 years ago

Options:

  1. Introduce a flag to opt-in to adding .indices to the result:
    • I to indicate indices (upper-case I does not conflict with lower-case i).
    • d since both i is already used and n means something else in other RegExp engines.
      • NOTE: I intend to propose n at a future meeting.
    • o to indicate offsets, and rename the .indices property to .offsets.
  2. Introduce a new method to RegExp, such as .execIndices.
    • Indicies would not be available to String.prototype.match or String.prototype.matchAll unless we extend String as well.
    • If we add matchIndices and matchIndicesAll to String, we further complicate RegExp subclassing by either needing a @@matchIndices and @@matchIndicesAll or needing to extend the existing @@match and @@matchAll.
  3. Go back to passing a callback or options bag.
    • As with (2), this is a subclassing hazard due to the need to pass the callback or options bag through @@match and @@matchAll.
  4. Continue with current implementations that are executing the RegExp twice.
    • This would be unfortunate for users who want to leverage indices, as they are forced to pay the cost of double execution. However, this is still far cheaper than the complex logic needed to calculate capture group offsets without this feature.
  5. Wait to see if implementation can devise an alternative solution that does not incur the performance penalty.
    • I will leave it to implementors to discuss whether they believe there are other avenues to explore.
  6. Something we haven't discussed?
rbuckton commented 3 years ago

Given that RegExp match indices is at Stage 3, (1) is the simplest and cheapest solution that doesn't require an overhaul of the entire specification. The flag would be easily accessible to RegExpBuiltinExec and only requires several small changes:

To me (2) and (3) seem like a non-starters due to the complexity of the change as well as the related subclassing hazards. (4) also seems acceptable to me, if unfortunate, as I would gladly pay the penalty for re-execution of the RegExp to leverage a capability that doesn't currently exist.

I'm option to other options (6) if there are any that make sense given the current stage of the proposal.

If we choose to go with (1), we still need to bikeshed the flag. Some I've discussed this with are unhappy with the idea of an upper-case RegExp flag, so I may not be the top choice. I indicates that o would likely require changing the property name in the spec from indices to offsets. While that's not necessarily the case, it would be confusing to introduce a flag like o but have it create an indices property (unless o means that .exec returns an array consisting of [start, end] tuples instead of strings, which would again be a major change to the proposal).

If I were to use ranked-choice voting, my top 3 would be:

  1. Option (1) with the d flag.
  2. Option (2) with the I flag.
  3. Option (4), with the chance that option (5) becomes viable and implementations can switch to a more optimized solution in the future.
ljharb commented 3 years ago

Using I will be very confusing and hard to google; i would vastly prefer using a distinct lowercase letter if a flag is the chosen path.

gibson042 commented 3 years ago

"d" and "o" flags both already exist for Perl regular expressions, although even their own documentation denigrates them severely: https://perldoc.perl.org/perlre#Modifiers . None of the flag options seem great to me, though. I would personally be fine with the stay-the-course (4), and unhappy with the symmetry-breaking (2) and (3).

msaboff commented 3 years ago

"d" and "o" flags both already exist for Perl regular expressions, although even their own documentation denigrates them severely: https://perldoc.perl.org/perlre#Modifiers . None of the flag options seem great to me, though. I would personally be fine with the stay-the-course (4), and unhappy with the symmetry-breaking (2) and (3).

Given that Perl perl documentation denigrates both the /d and /o flags, that eliminates any barriers for using them as JavaScript RegExp flags to me. We certainly wouldn't add to JavaScript either /d, the the old, problematic, pre-5.14 Default character set behavior or /o, the compile once flag, as that is an implementation detail.

Option (2) has ergonomic issues for some fonts and (3) would eliminate the use in RegExp literals.

The (4) stay-the-course will likely hinder the uptake of this feature.

rbuckton commented 3 years ago

@ljharb: i is already hard to google without using + or "", as Google by default treats it as a noise word and drops it from the search. However, I'd venture to guess that the cross section of individuals who (a) search for regex i flag and (b) know the ins and outs of search engine syntax is probably fairly large so I'll concur with your point.

@msaboff: A native implementation of (4) is likely still faster than a polyfill version of (4) like in https://github.com/rbuckton/regexp-match-indices, and better than the alternative of not having the feature when you need it. Option (4) at least provides a path for the possibility of a future where (5) might be possible without deprecating a flag.

Overall though, I'm leaning towards (1) using the d flag as that requires the least changes to the specification.

msaboff commented 3 years ago

Overall though, I'm leaning towards d as that requires the least changes to the specification.

I assume you mean Option (1) using d as the flag based on your prior ranking of alternatives.

rbuckton commented 3 years ago

@msaboff: Yes, I am talking about (1) using the d flag. I'll update my comment.

rbuckton commented 3 years ago

I've created #49 as a PR against this proposal to get a better idea of what kind of changes we would need to make in the specification to use Option (1) with a d flag, should we choose that mitigation.