jamadden / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 2 forks source link

regex 0.1.20110514 findall overlapped not working with 'start of string' expression #10

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Apologies if this is again not the right place to post this

I'm trying to use regex 0.1.2011051 with the overlapped=True feature

It works great, unless I have the 'start of string' (caret) character in my 
regular expression:

>>> regex.findall(r"a.*b","abadalaba",overlapped=True)
['abadalab', 'adalab', 'alab', 'ab']
>>> regex.findall(r"^a.*b","abadalaba",overlapped=True)
['abadalab']

If I understand correctly, the second regexp should also produce the same 
results as the first one, since all the results are at the beginning of the 
string

Original issue reported on code.google.com by jcerr...@gmail.com on 20 May 2011 at 11:02

GoogleCodeExporter commented 9 years ago
The caret anchors the pattern to the start of the string. Only "abadalab" 
starts at the start of the string.

"findall" normally performs a series of searches, each search starting from 
where the previous one ended, so the substrings found won't be overlapping. but 
if the "overlapped" flag is turned on, each search starts from one character 
beyond where the previous one _started_, allowing you to find overlapping 
substrings.

Original comment by re...@mrabarnett.plus.com on 21 May 2011 at 12:19

GoogleCodeExporter commented 9 years ago
You're right, what I douche I am, the example that I provided is useless.
Let me try to make my point again. I don't know if this kind of regular 
expression value is valid on any regex interpreter. I hope you can clarify this 
to me.

Is there any reason why you don't include overlapping matches that start on 
_the same_ letter? Let me try with a new example below:

Input string: ' x one something and another something'
I want to get all the 'something's that have an 'x' before and whatever other 
stuff in between. Here, I would like to match: 'x one something' and 'x one 
something and another something'
I would have hoped regexp.findall(r"x.*something"," x one something and another 
something",overlapped=True) would produce that result. But like you said, after 
the last x.*something match is found, you advance a place and the second match 
is not found. In can find the other match if I do 
regexp.findall(r"x.*?something", ...), but I am toast if there is a third match 
in the middle.

Is this achievable with regular expressions at all? Why are the two results 
above not considered an overlap?

Thanks for your patience

Original comment by jcerr...@gmail.com on 22 May 2011 at 4:48

GoogleCodeExporter commented 9 years ago
I guess one solution, which works with regex.0.1.20110514 but not with the 
default python re module - or with Perl v5.10.1 for that matter is to use a 
variable-length lookbehind pattern:

regex.findall(r"(?<=x.*)something", ...)

Original comment by jcerr...@gmail.com on 22 May 2011 at 5:03

GoogleCodeExporter commented 9 years ago
A regex supports greedy match ".*" and lazy match ".*?" (lazy match was a later 
addition). I don't know of a regex implementation which supports what you're 
asking for. There are also the implementation details to work out...

How much demand would there be for it, anyway?

Although it's a form of pattern matching, and regex is pattern matching, it's 
not really a regex kind of thing.

Original comment by re...@mrabarnett.plus.com on 22 May 2011 at 5:20

GoogleCodeExporter commented 9 years ago
Yeah, I don't know how much demand would there be for this. And I already 
solved what I needed with the variable-length lookbehind, which seems to be 
working fine.

I also understand about the additional complexity of the implementation. 
Without knowing how it's currently implemented, I can imagine moving forward 
one step after every match must simplify the implementation.

Just to be clear, my only problem was that when I saw the availability of the 
'overlapped=True' flag, I thought it was reasonable to assume it would also 
find overlapping matches that start on the same character. Just to be clear, 
here is a much simpler example: take the string 'abb' and the match 'a.*b'. 
'ab' and 'abb' are both valid, overlapping matches imho. 

I'm not pushing hard for any change or implying demand here, just trying to 
clarify what my confusion was, in case it helps with other potential confused 
users :-)

Original comment by jcerr...@gmail.com on 22 May 2011 at 5:27