tc39 / proposal-regexp-match-indices

ECMAScript RegExp Match Indices
https://arai-a.github.io/ecma262-compare/?pr=1713
BSD 3-Clause "New" or "Revised" License
64 stars 13 forks source link

Do RegExp Match Indices have a negative impact on performance? #46

Closed PetaSoft closed 3 years ago

PetaSoft commented 4 years ago

Of cource, RegExp match indices make sense in some use cases as explained in the README.md file, but they are not needed in all use cases of the exec method. If someone iterates over a large text to find all matches for a regular expression I am worried about any performance impacts that can result from the creation and computation of the indices property of the match object.

If there are any serious performance impacts I would prefer to add a parameter to the exec method indicating whether someone wants to have created the indices property. Another solution would be to create a new method like execWithIndices which provides the indices property and to keep the current functionality of the exec method as it is.

ljharb commented 4 years ago

Aren't the indices already needed by the engine in order to return the matches?

PetaSoft commented 4 years ago

@ljharb Even if the match indices are already computed by the engine, an indices property with the corresponding arrays has to be constructed and added to the match object result. I don't know how much time that takes but it obviously takes some time. Furthermore, I don't know the engine internals but I am worried about performace impacts.

ljharb commented 4 years ago

Representatives of every major JS engine are in the room at TC39, and endorsed this proposal for stage 3, meaning they all believe it's implementable in a performant way. Do you have concerns about a specific engine that would be useful to consider?

PetaSoft commented 4 years ago

@ljharb Okay, I see that the decisions have already been made but I think that it is better to know than to believe and it seems that no performace tests have been made. Presumably, there is no problem and the impact of match indices on the performance is very small but I hoped that there would exist conrete performance data. It is the reason for my inital question.

Nevertheless, I would have preferred an overloaded function (i.e. an optional parameter) or a separate method for match objects with match indices how I mentioned in my initial question. In this case it would have no effect at all on performance if someone did not need match indices. It is not my decision. Therefore, I will take what I will get.

ljharb commented 4 years ago

tbh I'm not sure if your assumption holds or not - since this information is gathered by engines during the initial match, it might actually be more expensive to retrieve it later. I trust the browser/engine implementers' judgement here though :-)

msaboff commented 3 years ago

The V8 experience as reported to the committee at the Dec 2019 meeting that they incurred both a memory and performance cost when they implemented this feature. In the end, they decided to re-execute the RegExp on first access to the indices property.

We, the JSC team, have tried two different implementations that confirm with the current stage 3 spec. The first was to materialize the indices and underlying properties. This incurred a 17% performance loss on the Octane regexp benchmark test, the same one that the V8 team used to report their experience.

The second implementation was to save the RegExp and the resulting internally computed indices and lazily fill in the indices property and the underlying property on first access. This actually had worse performance as the property access intercepting code changed the match object so that it couldn't use the tuned array paths throughout the engine.

At this point, I don't want all users of RegExp.match() & RegExp.exec() to pay a performance penalty for a feature they don't use. The V8 team decided to save a copy of the RegExp and rerun the RegExp the first time the indices property is accessed, populating the indices property tree using the results of the second run. I don't think this has zero cost as the RegExp needs to be copied and saved, increasing memory use. It also requires the RegExp to be processed twice for users of this new feature. I don't want to implement a similar copy and rerun approach in JSC.

I coded up a possible alternative of adding a new RegExp flag. I chose n, but any available flag could be used. When a RegExp with this flag is used for a match, the indices property and its children are populate. With this proposed change I saw minimal is any performance regression.

I plan on making this proposed change to the proposal at the next TC-39 meeting.

rbuckton commented 3 years ago

@msaboff I'd be interested in discussing this flag. I don't have an issue with adding a flag to be explicit about returning indices if that is what's needed to avoid overhead for existing cases, but I disagree with using n. The n flag is used by both Perl and .NET to specify explicit captures, and I have a laundry list of additional RegExp features I'd like to propose in the near future that includes the n flag: https://gist.github.com/rbuckton/2f262b5298d4b2031cb7e0d5a1a62e19

We could keep things simple and use I for indices, as that is not a legal flag today. Regardless as to what we end up using, I would recommend we avoid using a flag character that already has a different meaning in another language to avoid confusion and give ourselves room to adopt RegExp functionality common in other languages.

msaboff commented 3 years ago

I posted a slide set to discuss the perf issues the JSC found with varying implementations at the Nov 2020 meeting. I think we have an implementation that doesn't punish existing code, but it punishes match.indices usage.

@msaboff I'd be interested in discussing this flag. I don't have an issue with adding a flag to be explicit about returning indices if that is what's needed to avoid overhead for existing cases, but I disagree with using n. The n flag is used by both Perl and .NET to specify explicit captures, and I have a laundry list of additional RegExp features I'd like to propose in the near future that includes the n flag: https://gist.github.com/rbuckton/2f262b5298d4b2031cb7e0d5a1a62e19

We could keep things simple and use I for indices, as that is not a legal flag today. Regardless as to what we end up using, I would recommend we avoid using a flag character that already has a different meaning in another language to avoid confusion and give ourselves room to adopt RegExp functionality common in other languages.

Using a flag mitigates the perf issues. I'm not beholden to n, but would prefer a lower case letter.

rbuckton commented 3 years ago

I don't mind o, as it doesn't seem to be used in the RegExp engine's I've looked at, but that would better align with renaming the indices array to offsets (which I am also not opposed to, if consistency is preferred).

msaboff commented 3 years ago

I don't mind o, as it doesn't seem to be used in the RegExp engine's I've looked at, but that would better align with renaming the indices array to offsets (which I am also not opposed to, if consistency is preferred).

I chose o because of offsets.

It is a more involved change to the proposal to use offsets. I don't care if the property is indices or offsets. It makes sense to align with other languages if possible. I haven't researched what other languages do for this feature.

rbuckton commented 3 years ago

It depends on the language, most offer indices as part of the match result in some way, as referenced in the Prior Art of the explainer.

My original hope was that the internal representation of a RegExp exec result would be a wrapper over the original input string and a list of start/end offsets and would emulate a native Array (and as such be unobservable to the end user that there was a difference) such that that when accessed via match[0] it would produce the string and via match.indices[0] would produce the start/end tuple, but seems like that would result in a lot of work to work around v8 and JSC performance for fast array access.

msaboff commented 3 years ago

My original hope was that the internal representation of a RegExp exec result would be a wrapper over the original input string and a list of start/end offsets and would emulate a native Array (and as such be unobservable to the end user that there was a difference) such that that when accessed via match[0] it would produce the string and via match.indices[0] would produce the start/end tuple, but seems like that would result in a lot of work to work around v8 and JSC performance for fast array access.

Turns out fast array access wasn't an issue. For JSC, and probably other engines as well, the matching code returns the start/end offsets with the numeric values the proposal specifies as native C++ ints. We currently allocate the space to store those offsets on the C++ stack, before invoking the matching code. C++ stack allocation is free.

I believe that our RegExp exec result is a wrapper over the original input string that when accessed via match[0] produces a string. The perf rub comes from all the allocations and population of match.indices and the object tree under it.

rbuckton commented 3 years ago

I've created #47 to discuss specific mitigations.

rbuckton commented 3 years ago

I believe that our RegExp exec result is a wrapper over the original input string that when accessed via match[0] produces a string. The perf rub comes from all the allocations and population of match.indices and the object tree under it.

If I understand what you're saying, the "array" returned by exec in JSC isn't a plain old JS array, but rather something that "acts" like an array and synthesizes the match elements lazily? Is there a reason this approach wouldn't also work for indices?

msaboff commented 3 years ago

If I understand what you're saying, the "array" returned by exec in JSC isn't a plain old JS array, but rather something that "acts" like an array and synthesizes the match elements lazily? Is there a reason this approach wouldn't also work for indices?

We currently create the equivalent of a C array of int's on the stack or heap and then call the matcher. Removing some details we basically do:

    int numberCaptures = ...; // From the RegExp object we are matching.
    int offsetVectorSize = (numberCaptures + 1) * 2;   // Add 1 for the main match and multiply by 2 for start & end

    // We actually use an internal Vector object to allocate a default size inline on the stack.
    // If we need more than the default, we malloc a larger array and use it.
    int offsetVector[offsetVectorSize];

    int matchResult = executeMatch(string.characters(), startOffset, string.length(), offsetVector, regExpContext);

We need to convert that C array of ints to a JS Array of Numbers, either when the match succeeds or lazily on first access. Both approaches have perf issues. JSC uses a boxed integer format for JS Numbers. Before we box them, we need to use them as string offsets to create the string results. Then we need to box them.