google / re2

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
BSD 3-Clause "New" or "Revised" License
8.94k stars 1.13k forks source link

Support negation of entire pattern (not look-ahead/-behind) #441

Closed Roy-Orbison closed 1 year ago

Roy-Orbison commented 1 year ago

Many services, including Google's own, allow input of an RE2 pattern to perform some task, but only as a positive match. They frequently have no application-level provision to negate the pattern, and aren't interested in feature change requests. A pattern sequence could be provided to achieve this, without the danger assertions pose, by specifying one that would otherwise cause a syntax error. Beginning a pattern with (?!) could work as it is similar to PCRE's negative look-ahead assertion grouping but contains nothing. The matching process could consume the sequence as a flag, perform normal, linear matching with the remainder of the pattern, then apply the flag to negate the final result.

Apache does something similar with the ! pattern prefix, but this prefix would not be backwards-compatible with RE2.

As it is now, the only way to do a negative match, when the user has no control over entire pattern negation, is to write monstrosities like this:

^(?:[^w]+|w[^o]|wo[^r]|wor[^d])*(?:|w(?:|o(?:|r)))$

This example prevents matching word but still suffers from the Scunthorpe problem.

junyer commented 1 year ago

If anything, I think this should really be done with an option like (*NEGATE) that must appear at the beginning of the pattern string. (See https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC2 for various options in a remarkably similar vein.) However, this would never be done for RE2 without prior agreement/arrangement to do likewise for the Go regexp package and RE2/J and preferably the Rust regex crate as well. Please feel free to steer this feature request through this two-phase commit protocol and loop me in as you see fit. I will reopen this issue iff the aforementioned projects reach some consensus.

rsc commented 1 year ago

I think (?!) is a PCRE expression that never matches anything at all. Repurposing it with a different meaning seems like a mistake.

Making up an RE2-specific syntax also seems like a mistake.

Our answer from the beginning has been that if you want to incorporate boolean formulas on regexps such as NOT and AND, the place to do that is in a higher-level API above the regexp API.

Also the pattern given for matching any text that does not contain "word" is incorrect, further driving home that the best way to write this is !RE2::PartialMatch(text, "word").

BurntSushi commented 1 year ago

Additionally, there would probably be API oddities to figure out. For example, what happens when you ask for submatches for one of these negated patterns? Or what happens when you ask for the match offsets? There are perhaps answers to those questions, but they're likely to be a bit ham-fisted I think.

I overall agree that this should be handled at the application level.

Roy-Orbison commented 1 year ago

(*NEGATE) would be great, I only said the example sequence "could work", I don't care what it is. I also agree that these things should be handled at the application level, but it's rather naive to expect that can be a reality. Every single software author, and all future authors, would have agree to modify all their existing and future code to ensure their users have a negation option where it might be relevant, i.e. impossible.

I'm perfectly aware that the example pattern is not the way negation should be done, but I can't put !RE2::PartialMatch(text, "word") into some random organisations' codebase. When one uses a web-based service that allows input of an RE2, but not a negative match option, I am forced to use an ugly expression. This is a common case because RE2 behaves well on untrusted expressions.

Matching groups or offsets are not relevant. If one wants negation, the result would be the same as for a positive pattern that failed to match, hence the result would contain neither.

BurntSushi commented 1 year ago

Matching groups or offsets are not relevant. If one wants negation, the result would be the same as for a positive pattern that failed to match, hence the result would contain neither.

And in the case when the pattern matches...? You're only considering the non-match case.

Roy-Orbison commented 1 year ago

You're only considering the non-match case.

Because negation implies a boolean operation, e.g. in a filter. If the result was positive, the match result would be the entire subject string with no subgroups, because the subgroups in the pattern would not have been found.

BurntSushi commented 1 year ago

the match result would be the entire subject string with no subgroups

That's exactly the problem... Today, you can assume that if abc(\w+) matches, then there must exist a valid group at indices 0 and 1 with match offets. But the presence of your proposed negation operator means that assumption is no longer true.

You can't wave this away because you have to define the behavior of a negated regex on these other APIs. You can't just say "this only applies to a boolean match" because regex objects have more operations defined on them than boolean matches.

There are more issues lurking here I think. That is to say, I don't think this feature request is fully specified.

junyer commented 1 year ago

FWIW, Sulzmann and van Steenhoven addressed this issue of capturing groups within negation in A Flexible and Efficient ML Lexer Tool based on Extended Regular Expression Submatching section 4.1:

Operation ¬ effectively cancels any submatchings which arise below negation by simply recording the identity matching function id. The reason is that we can not give any well-defined meaning to these submatchings. For example, consider x : ¬(y : a∗). Suppose the pattern matches some word. Then, pattern variable x will bind any word not containing any letter a. Clearly, the binding of y is nonsensical here because (due to the outer negation) there cannot be any match for a∗.

I had wondered whether RE2 could optimise away any capturing groups of a negated regular expression, but this code would prevent that. Moreover, like @BurntSushi said, the caller would need to be prepared for the regular expression to match, but with null capturing groups. Note that "change the caller to handle this" is strictly worse than "change the caller to handle the negation at the application level" because not doing the former breaks an entire class of existing usage. At this point, I'm not sure that there's any satisfactory path forwards.

P.S. Hyperscan has supported logical combinations for the last five years. It's worth studying the design and implementation of that feature if you wish to pursue this matter further. Especially because I wouldn't be keen to plumb negation through the internals of RE2::Set and FilteredRE2.

Roy-Orbison commented 1 year ago

You can't wave this away

I didn't. I said the match would be the input string, i.e. only the 0th group and no subgroups, not null groups.

you can assume that if abc(\w+) matches, then there must exist a valid group at indices 0 and 1 with match offets.

Indeed, but if the pattern were (*NEGATE)abc(\w+), then that same inference would be incorrect as one would be expecting the entire string to not match abc(\w+), so the positive result would be a match on the entire input string.

If it's as difficult to achieve as @junyer outlines, then that's a pity. Having [^…] as the only negation inside patterns is very limiting.

BurntSushi commented 1 year ago

Indeed, but if the pattern were (*NEGATE)abc(\w+), then that same inference would be incorrect as one would be expecting the entire string to not match abc(\w+), so the positive result would be a match on the entire input string.

This doesn't address the assumption that callers can make today about all possible regex patterns. It fails to hold with this new negation operator.

Bottom line is that this feature request needs far more detail to be fully specified. And I agree that this is not easy to implement and likely has more as yet seen difficulties beyond what has been pointed out so far.

I didn't. I said the match would be the input string, i.e. only the 0th group and no subgroups, not null groups.

You are though, because in some of the regex engines (like the pikevm) the 0th group is treated no differently than the other groups.

Having [^…] as the only negation inside patterns is very limiting.

The limits are exactly the point of using an engine like RE2 in the first place.

Roy-Orbison commented 11 months ago

You are though, because in some of the regex engines (like the pikevm) the 0th group is treated no differently than the other groups.

Let go of the assumption that matching groups should appear in the result. The result would be that of a positive pattern, e.g. .*, i.e. no subgroups. Internally using subgroups to consider a negation does not necessitate a their (illogical) existence in a result.

The limits are exactly the point of using an engine like RE2 in the first place.

Which I acknowledged already. There is such a thing as too limiting. I'me seeing RE2 in all sorts of places now, crippling simple tasks.

BurntSushi commented 11 months ago

You quoted me out of context unfortunately. It's not just my assumption.

Roy-Orbison commented 11 months ago

Why should any caller be able to look for parentheses in a pattern, without actually parsing it, then assume all possible results will contain subgroups? None of what you've said attests that these assumptions are correct. Patterns like a|b(c) currently may or may not contain a subgroup in the result depending on the input. A valid negation pattern could also be restricted to using only non-capturing groups, the result would still be the entire string with no subgroups.

It's like a person asking "Can you look in this barrel to see if there are no apples?" and having the searcher's answer be "I cannot tell you because I don't know exactly where to tell you they aren't."

k-takata commented 11 months ago

(This may be out of context.) I introduced the absence operator (?~word) in Onigmo a few years ago and it has been available in Ruby since 2.4.1. Document: https://github.com/k-takata/Onigmo/blob/dd8a18af5c2f2871104b1bdbf3bbb597ec9e4665/doc/RE#L311-L341 See also: https://github.com/k-takata/Onigmo/issues/87

E.g.: (?~word) matches: "", "w", "wo", "wor", "foobar", etc. (?~word) doesn't match: "word", "wwword", "wwwordddd", etc.

/\*(?~\*/)\*/ matches C style comments: /**/, /* foobar */, etc. \A/\*(?~\*/)\*/\z doesn't match /**/ */. This is different from \A/\*.*?\*/\z which uses a reluctant quantifier (.*?).

It was originally proposed in Tanaka Akira's paper [PDF] (which was written in Japanese). He said that this operator is regular (in the meaning of formal language theory). So, I think that this can be also implemented in DFA.

junyer commented 11 months ago

Like I said back in August, RE2 won't unilaterally add syntax. Navigating the Go proposal review process may not be a fruitful exercise, but the discussion here has run its course and, as such, I assure you that continuing will not be a fruitful exercise. :)