Open jcgoble3 opened 8 years ago
Blind comment, without having studied the code in forever: maybe the current method of matching quantifiers doesn't require a recursive match
call per iteration for this. The current single-character quantifiers call singlematch
, which returns a boolean for success or failure, and match
returns a pointer to the next char
in the string following the match or NULL
on failure. So maybe call match
over the subpattern, and use the return value to determine whether to backtrack? An array of backtracking points is still needed for greedy quantifiers (not for the lazy -
, as it will never have to give up a match), so that will have to be taken into consideration. Might need to deal with malloc
/realloc
/free
here (unless it's possible to resize a Lua userdata, but the former might be a better idea anyway).
Hmm. The problem with that is that match
currently receives the pattern end solely through the MatchState
. So match
would need to be modified to take an additional argument, that being a pointer to the end of the subpattern (and thus the point at which to return success/failure). This is possible, as all uses of the pattern end outside of match
deal with exceptions from patterns unexpectedly ending, and thus need the "real" pattern end in any case.
Actually, (
and )
currently result in match
calling start_capture
and end_capture
, respectively, and those handle the parenthesis and then recursively call match
to complete the match or fail and backtrack. So it seems that some extra handling of these characters is needed in match
itself to recognize these before beginning the match, yet captures still need to be handled by the *_capture
functions.
Possible solution: change up the syntax for quantifiers on groups. Make all quantified groups automatically non-capturing (see issue #8), and make the quantifier the first character after the (
, followed by a :
. So a greedy repeating group would start with (+:
or (*:
, a lazy repeating group would start with (-:
, and so on. This has the benefit of being able to be generalized to {m,n}
and other multi-character quantifiers (see issue #19). With this proposal, optional groups would start with (?:
, however, which conflicts with the general syntax for non-capturing groups.
Perhaps retain the ?
as the opening character of any special group, followed by the quantifier, followed by :
? This would have the advantage of not conflicting with any regex flavor at all. No notable regex flavor uses (?*
or (??
for any purpose, and (?-
and (?+
are only used as relative subroutine calls (which isn't something this module will ever get into), but in that case are followed by a number and then a )
, never by a :
.
Captures inside the group of course must still be handled, and there is the possibility that they could match multiple times. Most regex flavors return the last match, but that is likely to be complicated. Really, there's no reason to overwrite captures at all. I think we should ban captures inside any quantified group, unless the quantifier is a ?
(including lazy and possessive forms), which can only match once at most.
Then, should a (??:
group be capturing or non-capturing? I think it should non-capturing; if someone wants to capture it, it's easy enough to put the subpattern inside a separate set of parentheses.
Actually, this still has the issue with finding the end of the group that can succeed with zero matches, and with properly counting capture groups that don't participate in the match in that case. The latter could potentially be worked around by not allowing captures even in the case of a (??:
group. The former is tricker, but potentially reusing the matchbalance
machinery (turn matchbalance
into a front for an internal function that works without a MatchState
and can handle both cases) would work.
Really, though, I generally don't like the idea of inventing syntax like this, or banning captures like this. Perhaps it would be best to investigate issue #10 regarding compiling patterns to bytecode instead.
Idea: Modify matchbalance
as described in the previous comment. Instead of start_capture
and end_capture
, call a handlegroup
function when encountering a (
. handlegroup
finds the end of the group with the internal version of matchbalance
, then checks the next character(s) for a quantifier and records the start of the capture (unless non-capturing). Taking the quantifier into account, handlegroup
creates a new MatchState
for the "subpattern", then calls match
with the new MatchState. When match
returns, it saves the capture (or a placeholder if the group succeeds with zero matches) and upon satisfying the quantifier calls match
with another new MatchState
representing the rest of the pattern (if any). (Or can the original MatchState
be reused? Haven't looked at the code in well over a year.)
Lazy and possessive quantifiers should be easy on groups; possessive is non-backtracking and lazy just means trying the rest of the pattern first and between each subgroup match. Greedy is the tricky part, but a Lua table of backtracking data should solve that. Also, if we can determine that the subgroup is constant length (always matches the same number of characters, so no quantifiers at minimum) and what that length is (call it x
), we can avoid a list of backtracking data and just retreat x
characters on each failure (similar to the existing single-character *
and +
quantifiers that retreat by 1 for each failure). Perhaps an enhanced version of the matchbalance
internal for this?
Given #7, handlegroup
could also check for |
(in enhanced patterns) and handle the alternation. Then, since |
could appear outside of a group, handlegroup
becomes the entry point for the entire matching process (it would check whether its first character is (
to see if it is the main call).
A "nice to have" feature would be the ability to apply quantifiers to groups and not just individual characters/classes/sets. Probably not possible without a new MatchState (
MatchStateE
for Enhanced?) that has space to track backtracking points.The current code uses recursive calls over the
match
function to handle backtracking (match
returnsNULL
on no match, which either unwinds the stack all the way to signal no match at all or is caught somewhere down the stack by a caller that performs the backtracking and makes another recursive call). However, the quantifiers handle all their matches with only one recursive call, adding or subtracting another character each timeNULL
is returned. Here, each iteration could match different characters and a different number of bytes, so doing it that way would require a separate recursive call for each match, easily triggering a stack overflow.Need to study the code for Python's
re
module here to understand how the matching works there; that would probably help a lot here.