mrabarnett / mrab-regex

Other
430 stars 49 forks source link

Unexpected result for findall, overlapped=True with flexible quantifiers #515

Closed leogott closed 9 months ago

leogott commented 9 months ago

During yesterday's (2023-12-12) advent of code, I tried to be smart and use regex for finding ways to arrange fixed-length, ordered groups in a simple string. (The problem is a version of picross.)

Using this regex module, I wanted to get/count all the ways a pattern could match a string. In the process I encountered a strange behavior of this regex module

MWE

>>> import regex as re
>>> pattern = r"(?:[^#]+?)([^.]{1}(?!#))"  # relaxed match any number of chars then capture the next unless followed by hash
>>> greedy = ... # the above, but with + instead of +?
>>> arg = ".abcde"
>>> re.findall(pattern*2, arg, overlapped=True)
[(a, c), (b, d), (c, e)]  # also expected (a, d), (a, e), (b, e)
>>> re.findall("^"+pattern*2, arg, overlapped=True)
[(a, c)]
>>> re.findall(greedy*2, arg, overlapped=True)
[(c, e), (c, e), (c, e)] # yes, really

While I'm pretty sure I'm doing something wrong, I'm also almost certain I found some kind of bug.


python 3.12.0 hab00c5b_0_cpython conda-forge regex 2023.10.3 py312h98912ed_0 conda-forge

mrabarnett commented 9 months ago

findall normally continues searching from where the current match ended.

An overlapped findall, on the other hand, continues searching 1 character after where the current match started, meaning that the next match might overlap the current one. It does not look for all of the ways that it could match at a certain position.

Below is a shortened description of how your example code is matching.

Lazy pattern, simplified to [^#]+?([^.](?!#))[^#]+?([^.](?!#)).

[^#]+?         matches "."
([^.](?!#))    matches "a" (captured)
[^#]+?         matches "b"
([^.](?!#))    matches "c" (captured)

First match is ('a', 'c').

Advance by 1 character.

[^#]+?         matches "a"
([^.](?!#))    matches "b" (captured)
[^#]+?         matches "c"
([^.](?!#))    matches "d" (captured)

Second match is ('b', 'd').

Greedy pattern, simplified to [^#]+([^.](?!#))[^#]+([^.](?!#)).

[^#]+          matches ".abcde"
([^.](?!#))    fails to match, so backtrack

[^#]+          matches ".abcd"
([^.](?!#))    matches "e" (captured)
[^#]+          fails to match, so backtrack

[^#]+          matches ".abc"
([^.](?!#))    matches "d" (captured)
[^#]+          matches "e"
([^.](?!#))    fails to match, so backtrack

[^#]+          matches ".ab"
([^.](?!#))    matches "c" (captured)
[^#]+          matches "de"
([^.](?!#))    fails to match, so backtrack

[^#]+          matches ".ab"
([^.](?!#))    matches "c" (captured)
[^#]+          matches "d"
([^.](?!#))    matches "e" (captured)

First match is ('c', 'e').

Advance by 1 character.

[^#]+          matches "abcde"
([^.](?!#))    fails to match, so backtrack

[^#]+          matches "abcd"
([^.](?!#))    matches "e" (captured)
[^#]+          fails to match, so backtrack

[^#]+          matches "abc"
([^.](?!#))    matches "d" (captured)
[^#]+          matches "e"
([^.](?!#))    fails to match, so backtrack

[^#]+          matches "ab"
([^.](?!#))    matches "c" (captured)
[^#]+          matches "de"
([^.](?!#))    fails to match, so backtrack

[^#]+          matches "ab"
([^.](?!#))    matches "c" (captured)
[^#]+          matches "d"
([^.](?!#))    matches "e" (captured)

Second match is ('c', 'e').