tc39 / proposal-regexp-atomic-operators

https://rbuckton.github.io/proposal-regexp-atomic-operators
BSD 3-Clause "New" or "Revised" License
16 stars 1 forks source link

Alternative: use a flag instead #1

Open pygy opened 2 years ago

pygy commented 2 years ago

Hello, possessive matching is something that I would love to have and make accessible so that people can rely on RegExps without second thoughts about ReDOS, but I wonder if operators and groups are the correct solution for this, as I can't imagine a scenario where I would like to combine backtracking and possessive matching in a single expression.

The traditional operators make sense in the context of small searchers (so that you can e.g search for /\w+Suffix/ without ceremony), but for lexers, sticky, possessive matching is what you want usually, and having to resort to even more unreadable operators would be sub-optimal.

Having this as a global flag would also keep the greedy non-greedy distinction in possessive context, which I find useful.

slevithan commented 4 months ago

I can't imagine a scenario where I would like to combine backtracking and possessive matching in a single expression.

Over the years it's been relatively common for me to want to mix atomic groups or possessive quantifiers and non-atomic versions of them. It's hard to give an example off the bat because it's generally most needed when trying to improve the performance of longer, complex regexes. But you can't just make everything in them atomic/possessive because the changes to backtracking behavior can easily change what a regex matches.

A simple example (from here): a(bc|b)c matches abcc and abc, whereas a(?>bc|b)c matches only abcc and not abc. Changing all greedy quantifiers to possessive would have the same sort of often-not-intended/desirable behavior.

A simple example with some actual utility: \b(?>integer|insert|in)\b quickly fails when given the target string integers, without backtracking. This is a nice optimization that never changes what the non-capturing group matches thanks to the \b that follows, but it also doesn't need a flag. Implementations can detect cases like this and automatically apply atomic behavior. I believe that newer versions of .NET, at least, include just such an optimization.

having to resort to even more unreadable operators would be sub-optimal.

IMO (?>...) is at least equally readable compared to (?:...).

Having this as a global flag would also keep the greedy non-greedy distinction in possessive context

Wouldn't possessive + non-greedy just be equivalent to strictly using the lower bound of a quantifier? I.e., a?? in possessive form would be equivalent to a{0}, and possessive a{2,5}? would be the same as a{2}.

IMO a global flag would change the meaning of too many regexes (and in sometimes non-obvious ways since it would only change results with certain target strings), and as a result it wouldn't be recommendable as a best practice anyway. Such a flag also has no prior art, AFAIK, and I think for good reason. Killing much-of-but-not-all backtracking via such a flag would be harder to understand than adding something like Chromium's non-standard linear mode (flag l, hidden behind a V8 flag), which IMO would be a better approach to removing any potential for superlinear backtracking. But standardizing something like that (and splitting ECMAScript regexes into two different implementations with different behavior and supported syntax) is a different ballgame than the low-hanging (and much more flexible) fruit of this proposal as it stands.