Quantifiers on parenthetical groups

jcgoble3 commented 8 years ago

A "nice to have" feature would be the ability to apply quantifiers to groups and not just individual characters/classes/sets. Probably not possible without a new MatchState (MatchStateE for Enhanced?) that has space to track backtracking points.

The current code uses recursive calls over the match function to handle backtracking (match returns NULL on no match, which either unwinds the stack all the way to signal no match at all or is caught somewhere down the stack by a caller that performs the backtracking and makes another recursive call). However, the quantifiers handle all their matches with only one recursive call, adding or subtracting another character each time NULL is returned. Here, each iteration could match different characters and a different number of bytes, so doing it that way would require a separate recursive call for each match, easily triggering a stack overflow.

Need to study the code for Python's re module here to understand how the matching works there; that would probably help a lot here.

jcgoble3 commented 7 years ago

Blind comment, without having studied the code in forever: maybe the current method of matching quantifiers doesn't require a recursive match call per iteration for this. The current single-character quantifiers call singlematch, which returns a boolean for success or failure, and match returns a pointer to the next char in the string following the match or NULL on failure. So maybe call match over the subpattern, and use the return value to determine whether to backtrack? An array of backtracking points is still needed for greedy quantifiers (not for the lazy -, as it will never have to give up a match), so that will have to be taken into consideration. Might need to deal with malloc/realloc/free here (unless it's possible to resize a Lua userdata, but the former might be a better idea anyway).

jcgoble3 commented 7 years ago

Hmm. The problem with that is that match currently receives the pattern end solely through the MatchState. So match would need to be modified to take an additional argument, that being a pointer to the end of the subpattern (and thus the point at which to return success/failure). This is possible, as all uses of the pattern end outside of match deal with exceptions from patterns unexpectedly ending, and thus need the "real" pattern end in any case.

Actually, ( and ) currently result in match calling start_capture and end_capture, respectively, and those handle the parenthesis and then recursively call match to complete the match or fail and backtrack. So it seems that some extra handling of these characters is needed in match itself to recognize these before beginning the match, yet captures still need to be handled by the *_capture functions.

Possible solution: change up the syntax for quantifiers on groups. Make all quantified groups automatically non-capturing (see issue #8), and make the quantifier the first character after the (, followed by a :. So a greedy repeating group would start with (+: or (*:, a lazy repeating group would start with (-:, and so on. This has the benefit of being able to be generalized to {m,n} and other multi-character quantifiers (see issue #19). With this proposal, optional groups would start with (?:, however, which conflicts with the general syntax for non-capturing groups.

Perhaps retain the ? as the opening character of any special group, followed by the quantifier, followed by :? This would have the advantage of not conflicting with any regex flavor at all. No notable regex flavor uses (?* or (?? for any purpose, and (?- and (?+ are only used as relative subroutine calls (which isn't something this module will ever get into), but in that case are followed by a number and then a ), never by a :.

Captures inside the group of course must still be handled, and there is the possibility that they could match multiple times. Most regex flavors return the last match, but that is likely to be complicated. Really, there's no reason to overwrite captures at all. I think we should ban captures inside any quantified group, unless the quantifier is a ? (including lazy and possessive forms), which can only match once at most.

Then, should a (??: group be capturing or non-capturing? I think it should non-capturing; if someone wants to capture it, it's easy enough to put the subpattern inside a separate set of parentheses.

jcgoble3 commented 7 years ago

Actually, this still has the issue with finding the end of the group that can succeed with zero matches, and with properly counting capture groups that don't participate in the match in that case. The latter could potentially be worked around by not allowing captures even in the case of a (??: group. The former is tricker, but potentially reusing the matchbalance machinery (turn matchbalance into a front for an internal function that works without a MatchState and can handle both cases) would work.

Really, though, I generally don't like the idea of inventing syntax like this, or banning captures like this. Perhaps it would be best to investigate issue #10 regarding compiling patterns to bytecode instead.

jcgoble3 commented 5 years ago

Idea: Modify matchbalance as described in the previous comment. Instead of start_capture and end_capture, call a handlegroup function when encountering a (. handlegroup finds the end of the group with the internal version of matchbalance, then checks the next character(s) for a quantifier and records the start of the capture (unless non-capturing). Taking the quantifier into account, handlegroup creates a new MatchState for the "subpattern", then calls match with the new MatchState. When match returns, it saves the capture (or a placeholder if the group succeeds with zero matches) and upon satisfying the quantifier calls match with another new MatchState representing the rest of the pattern (if any). (Or can the original MatchState be reused? Haven't looked at the code in well over a year.)

Lazy and possessive quantifiers should be easy on groups; possessive is non-backtracking and lazy just means trying the rest of the pattern first and between each subgroup match. Greedy is the tricky part, but a Lua table of backtracking data should solve that. Also, if we can determine that the subgroup is constant length (always matches the same number of characters, so no quantifiers at minimum) and what that length is (call it x), we can avoid a list of backtracking data and just retreat x characters on each failure (similar to the existing single-character * and + quantifiers that retreat by 1 for each failure). Perhaps an enhanced version of the matchbalance internal for this?

jcgoble3 commented 5 years ago

Given #7, handlegroup could also check for | (in enhanced patterns) and handle the alternation. Then, since | could appear outside of a group, handlegroup becomes the entry point for the entire matching process (it would check whether its first character is ( to see if it is the main call).

jcgoble3 / lua-matchext

Quantifiers on parenthetical groups #6