substring capture - Githubissues

sebfisch commented 14 years ago

Currently it is not possible to identify substrings that matched against (parenthesized) parts of a regular expression. I have ideas how to extend the algorithm to achieve this but did not work them out yet.

sebfisch commented 14 years ago

Chris Kuklevicz has sent me comments about submatch capture via email:

I will jump right in with a small and hard and illuminating case for getting the correct POSIX "re-interpretation" answer.

This is an example my slowly developing OCaml version just spit up, lightly edited:

Pattern: (abc|de|fg|a|bcd|efg|ab|cdef|g|d|cde|cd|ef)* Text: abcdefg

The order of items in the alternation does not matter. Every possible way to match (I count 22), starting at 0, gives these results:

groupCap ((0 7) (5 7)) history ((tagA (0 7 1 5 0)) (repA (0)) (orbitA ((5 3))))
groupCap ((0 7) (4 7)) history ((tagA (0 7 1 4 0)) (repA (0)) (orbitA ((4 3))))
groupCap ((0 7) (6 7)) history ((tagA (0 7 1 6 0)) (repA (0)) (orbitA ((6 4 3))))
groupCap ((0 7) (6 7)) history ((tagA (0 7 1 6 0)) (repA (0)) (orbitA ((6 2))))
groupCap ((0 7) (5 7)) history ((tagA (0 7 1 5 0)) (repA (0)) (orbitA ((5 2))))
groupCap ((0 7) (4 7)) history ((tagA (0 7 1 4 0)) (repA (0)) (orbitA ((4 2))))
groupCap ((0 7) (6 7)) history ((tagA (0 7 1 6 0)) (repA (0)) (orbitA ((6 4 2))))
groupCap ((0 7) (4 7)) history ((tagA (0 7 1 4 0)) (repA (0)) (orbitA ((4 1))))
groupCap ((0 7) (6 7)) history ((tagA (0 7 1 6 0)) (repA (0)) (orbitA ((6 4 1))))
groupCap ((0 6) (4 6)) history ((tagA (0 6 1 4 0)) (repA (0)) (orbitA ((4 3))))
groupCap ((0 6) (2 6)) history ((tagA (0 6 1 2 0)) (repA (0)) (orbitA ((2))))
groupCap ((0 6) (4 6)) history ((tagA (0 6 1 4 0)) (repA (0)) (orbitA ((4 2))))
groupCap ((0 6) (4 6)) history ((tagA (0 6 1 4 0)) (repA (0)) (orbitA ((4 1))))
groupCap ((0 5) (3 5)) history ((tagA (0 5 1 3 0)) (repA (0)) (orbitA ((3))))
groupCap ((0 5) (2 5)) history ((tagA (0 5 1 2 0)) (repA (0)) (orbitA ((2))))
groupCap ((0 4) (3 4)) history ((tagA (0 4 1 3 0)) (repA (0)) (orbitA ((3))))
groupCap ((0 4) (2 4)) history ((tagA (0 4 1 2 0)) (repA (0)) (orbitA ((2))))
groupCap ((0 4) (1 4)) history ((tagA (0 4 1 1 0)) (repA (0)) (orbitA ((1))))
groupCap ((0 3) (0 3)) history ((tagA (0 3 1 0 0)) (repA (0)) (orbitA (())))
groupCap ((0 2) (0 2)) history ((tagA (0 2 1 0 0)) (repA (0)) (orbitA (())))
groupCap ((0 1) (0 1)) history ((tagA (0 1 1 0 0)) (repA (0)) (orbitA (())))
groupCap ((0 0) (-1 -1)) history ((tagA (0 0 1 -1 -1)) (repA (0)) (orbitA (())))

The matches are sorted in decreasing order of preference.

The first groupCap ((0 7) (5 7)) is the correct POSIX result. The overall match is between indexes 0 and 7 and the last use of the parenthesized expression matches between 5 and 7. So the last *-repetition captures "fg".

The worst match is at the bottom, matching nothing at the start.

The special thing to note is: of all the (0 7) captures the ones with (4 7) and (6 7) for the last repetition are not correct. Only (5 7) is correct. Thus the length of the last capture is not enough information.

The "history" is like the "mark", and the value above are actually sorted by a special comparison on the history values. While "history" is an internal implementation specific detail, it may help illustrate what I found was needed when choosing the "best" of two histories:

In tagA: Entries 0 and 1 of tagA are the start and end of the whole match and of the whole -subpattern. Entry 2 of tagA is "1" meaning that the -subpattern is finished and to please go compare the corresponding orbitA list of values. Entry 3 is the index of the most recent start of the parenthesized group (worst match has -1 since it was skipped). The final 4th entry is "0" meaning that the parenthesized group was successfully captured (except the worst match where the last "-1" in tagA means it was not captured).

In repA: the value is "0" since the matching is done. During the match of the -contents repA held a "1". For /ab{4,5}c/ the repA would start at 0 and then count from 1 to 4 or 5 while matching b{4,5} and then reset to 0. Thus "repA" is part of the identity of the state. If b{4,5} would be expanded into 5 copies of b then the repA value indicates which of the 5 copies the state would be associated with. For a -repetition there is only one copy so repA is unneeded here. If there were many nested repetitions then repA would be a longer array.

In orbitA: These are the intermediate positions in the text where the -operator looped, in a linked list stored in reverse order. These are needed to sort the histories properly. There is one list for each repetition operator such as `/a/,/b+/,/c{3}/,/d{7,10}/,/e{6,}/`.

The top line has the *-subpattern spanning (0 7) and the intermediate positions in orbitA are (5 3). Thus it used these repetitions:

(0 3) "abc"
(3 5) "de"
(5 7) "fg"

The lengths of these repetitions are greedy from left to right, constrained by the overall leftmost longest match. This is the winner!

The history comparison function is complicated; the comparison order is:

0th tag : smallest value is preferred
1st tag : largest value is preferred
2nd tag is for the *-orbit, it means go check the orbitA list:

The orbit lists are reversed and compared element by element:
- the largest value is preferred
- or first list to end is preferred
3rd tag : largest value is preferred
4th tag : skipped (used for computing the groupCap)

Side note: Aside from the 0th tag (leftmost) all comparisons will favor the largest index value (longest). The number and order and meaning of the tagA entries is determined by analyzing the regular expression tree.

A careful eye notices that there are two groupCap ((0 7) (5 7)) lines:

groupCap ((0 7) (5 7)) history ((tagA (0 7 1 5 0)) (repA (0)) (orbitA ((5 3))))
groupCap ((0 7) (5 7)) history ((tagA (0 7 1 5 0)) (repA (0)) (orbitA ((5 2))))

The second line is the series of *-repetitions:

(0 2) "ab"
(2 5) "cde"
(5 7) "fg"

The first repetition "ab" was not as greedy as the winning first line's "abc". When the orbitA lists are reversed the head entry of the first list is bigger (3 versus 2) and thus is preferred.

In an optimized matching engine those two possibilities would collide at the same NFA state after "abcde" and the second would be discarded.

The ever-growing list in orbitA is what makes this implementation have unbounded storage. The pre-comparison trick I mentioned before can convert this to bounded-storage.

For a long list of interesting patterns regarding substring captures to use as unit tests you can look at the .txt files inside the regex-posix-unittest package on hackage.

Side note: the 3rd tagA entry and the head orbitA entry (if present) agree merely because the end of the penultimate repetition is also the start of the parenthesized group in the ultimate repetition.

sebfisch commented 14 years ago

It will be interesting to see whether the POSIX preference can be expressed in both our implementation and its specification. After skimming your explanation, my impression is that the most greedy match wins, and it seems that a simple exact explanation of the POSIX preference is hard to give (without going into implementation details).

I shall probably use the unit tests to test the implementation, thanks!

sebfisch / haskell-regexp

substring capture #8