jamadden / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 2 forks source link

Problems fuzzy-matching on long strings #64

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I think that I have found a problem doing fuzzy-matches on long strings.  
Below, if I append 5000 spaces to a string, matching works correctly.  However 
if I append 50 million, it fails silently.
Is this a bug or am I misunderstanding something?

-- What steps will reproduce the problem?

# run the following:
In [183]: a = 'In Out _ __ ___ __builtin__ __builtins__ __doc__ __name__ 
__package__ _dh _exit_code _i _i1 _i2 _i3 _i4 _i5 _ih _ii _iii _oh _sh exit 
get_ipython help quit regex'

In [184]: re.findall('(pkage){i<=5}.*', " "*int(5e3)+a, re.U)
Out[184]: ['package']

In [185]: re.findall('(pkage){i<=5}.*', " "*int(5e8)+a, re.U)
Out[185]: []

- Expected
Expected output in both cases is the match: 'package'

- Versions:
Adding regex 0.1.20120209 to easy-install.pth file

Installed 
/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/re
gex-0.1.20120209-py2.7-macosx-10.5-i386.egg

On Mac OS X Lion

Original issue reported on code.google.com by mhis...@gmail.com on 22 Feb 2012 at 3:57

GoogleCodeExporter commented 9 years ago
Sorry, let me be detailed if obvious to avoid confusion.  The above snippet 
requires:

import regex as re

Original comment by mhis...@gmail.com on 22 Feb 2012 at 10:33

GoogleCodeExporter commented 9 years ago
Fixed in regex 0.1.20120301.

I'd also like to point out that 5e8 is 500 million, so it going to take a while 
to find a match! :-)

Original comment by re...@mrabarnett.plus.com on 1 Mar 2012 at 9:08

GoogleCodeExporter commented 9 years ago
Thanks!  I will test this in the next few days.

5e8 = 500 million, yes.  But I bet the new code is faster than my workaround - 
running regex.search over 1000-byte chunks of 5e8 chars!

Original comment by mhis...@gmail.com on 2 Mar 2012 at 6:12

GoogleCodeExporter commented 9 years ago

In my first comment, the re.findall line (using ver 20120209) was 

In [184]: re.findall('(pkage){i<=5}.*', " "*int(5e3)+a, re.U)
Out[184]: ['package']

Now, with ver 20120301, I get
In [24]: re.findall('(pkage){i<=5}.*', " "*int(5e3)+a, re.U)
Out[24]: [' __package']

So it looks like it was previously anchoring to the first character of the 
string (i.e. not allowing insertions before the first char).

I'm sorry I wasn't able to go back and check this with the 20120209 version; I 
might have made a mistake in my first comment.

Either behavior works for me; I'm just letting you know it seems to have 
changed.

Original comment by mhis...@gmail.com on 2 Mar 2012 at 9:28

GoogleCodeExporter commented 9 years ago
The change is mentioned in the PyPI page.

I'm going to add an option to adjust the match which was found (i.e. attempt to 
improve the fit). The problem I have is thinking of a suitable name for the 
option; it can't be called, say, ADJUSTMATCH or IMPROVEFIT because the letters 
"a" and "i" are already being used...

Original comment by re...@mrabarnett.plus.com on 2 Mar 2012 at 9:49

GoogleCodeExporter commented 9 years ago
I've added the ENHANCEMATCH flag to improve a first-match.

New in regex 0.1.20120303.

Original comment by re...@mrabarnett.plus.com on 3 Mar 2012 at 4:14

GoogleCodeExporter commented 9 years ago
Thank you - good work on this library, it's working great for me.

I'm late to the naming party but
OPTIMIZE-
REFINE-
SEARCH-(IMPROVEFIT)
TRY-(IMPROVEFIT)
all work too - though I just noticed R,S,T already are taken.  

Original comment by mhis...@gmail.com on 3 Mar 2012 at 6:26

GoogleCodeExporter commented 9 years ago
The only reason that the regex library has "TEMPLATE" is that the re module has 
it and there might be some code out there that uses it, although I've never any 
such code, and I'm not sure that the re module really does much with it either, 
as far as I recall.

Original comment by re...@mrabarnett.plus.com on 3 Mar 2012 at 6:56