Closed GoogleCodeExporter closed 9 years ago
The caret anchors the pattern to the start of the string. Only "abadalab"
starts at the start of the string.
"findall" normally performs a series of searches, each search starting from
where the previous one ended, so the substrings found won't be overlapping. but
if the "overlapped" flag is turned on, each search starts from one character
beyond where the previous one _started_, allowing you to find overlapping
substrings.
Original comment by re...@mrabarnett.plus.com
on 21 May 2011 at 12:19
You're right, what I douche I am, the example that I provided is useless.
Let me try to make my point again. I don't know if this kind of regular
expression value is valid on any regex interpreter. I hope you can clarify this
to me.
Is there any reason why you don't include overlapping matches that start on
_the same_ letter? Let me try with a new example below:
Input string: ' x one something and another something'
I want to get all the 'something's that have an 'x' before and whatever other
stuff in between. Here, I would like to match: 'x one something' and 'x one
something and another something'
I would have hoped regexp.findall(r"x.*something"," x one something and another
something",overlapped=True) would produce that result. But like you said, after
the last x.*something match is found, you advance a place and the second match
is not found. In can find the other match if I do
regexp.findall(r"x.*?something", ...), but I am toast if there is a third match
in the middle.
Is this achievable with regular expressions at all? Why are the two results
above not considered an overlap?
Thanks for your patience
Original comment by jcerr...@gmail.com
on 22 May 2011 at 4:48
I guess one solution, which works with regex.0.1.20110514 but not with the
default python re module - or with Perl v5.10.1 for that matter is to use a
variable-length lookbehind pattern:
regex.findall(r"(?<=x.*)something", ...)
Original comment by jcerr...@gmail.com
on 22 May 2011 at 5:03
A regex supports greedy match ".*" and lazy match ".*?" (lazy match was a later
addition). I don't know of a regex implementation which supports what you're
asking for. There are also the implementation details to work out...
How much demand would there be for it, anyway?
Although it's a form of pattern matching, and regex is pattern matching, it's
not really a regex kind of thing.
Original comment by re...@mrabarnett.plus.com
on 22 May 2011 at 5:20
Yeah, I don't know how much demand would there be for this. And I already
solved what I needed with the variable-length lookbehind, which seems to be
working fine.
I also understand about the additional complexity of the implementation.
Without knowing how it's currently implemented, I can imagine moving forward
one step after every match must simplify the implementation.
Just to be clear, my only problem was that when I saw the availability of the
'overlapped=True' flag, I thought it was reasonable to assume it would also
find overlapping matches that start on the same character. Just to be clear,
here is a much simpler example: take the string 'abb' and the match 'a.*b'.
'ab' and 'abb' are both valid, overlapping matches imho.
I'm not pushing hard for any change or implying demand here, just trying to
clarify what my confusion was, in case it helps with other potential confused
users :-)
Original comment by jcerr...@gmail.com
on 22 May 2011 at 5:27
Original issue reported on code.google.com by
jcerr...@gmail.com
on 20 May 2011 at 11:02