approximate matching -- feature request

GoogleCodeExporter commented 9 years ago

I'm currently using the TRE regex engine to match output from OCR, because it 
supports approximate matching.  Very useful.  Would be nice to have that 
capability in Python regex, as well.

Original issue reported on code.google.com by Bill.Jan...@gmail.com on 2 Jun 2011 at 8:33

GoogleCodeExporter commented 9 years ago

See http://laurikari.net/tre/

Original comment by Bill.Jan...@gmail.com on 3 Jun 2011 at 9:02

GoogleCodeExporter commented 9 years ago

I'm not entirely sure how easy this would be, although I have come up with a 
few ideas.

Could you provide me with some test data, including regex(es)?

Original comment by re...@mrabarnett.plus.com on 8 Jun 2011 at 8:12

GoogleCodeExporter commented 9 years ago

The code for TRE is under a "two-clause BSD licence", whatever that means.  But 
the approximate matching code is in a single C file there.  Could it be lifted?

Original comment by ditsta...@gmail.com on 9 Jun 2011 at 1:28

GoogleCodeExporter commented 9 years ago

There are a lot of tests as part of TRE; would they be enough?  Assuming you 
use the TRE syntax.

Bill

  /*
   * Approximate matching tests.
   *
   * The approximate matcher always searches for the best match, and returns
   * the leftmost and longest one if there are several best matches.
   */

  test_comp("(fou){# ~1}", REG_EXTENDED, 0);
  test_comp("(fuu){#}", REG_EXTENDED, 0);
  test_comp("(fuu){# ~}", REG_EXTENDED, 0);
  test_comp("(anaconda){ 1i + 1d < 1, #1}", REG_EXTENDED, 0);
  test_comp("(anaconda){ 1i + 1d < 1 #1 ~10 }", REG_EXTENDED, 0);
  test_comp("(anaconda){ #1, ~1, 1i + 1d < 1 }", REG_EXTENDED, 0);

  test_comp("(znacnda){ #1 ~3 1i + 1d < 1 }", REG_EXTENDED, 0);
  test_exec("molasses anaconda foo bar baz smith anderson ",
        0, REG_NOMATCH);
  test_comp("(znacnda){ #1 ~3 1i + 1d < 2 }", REG_EXTENDED, 0);
  test_exec("molasses anaconda foo bar baz smith anderson ",
        0, REG_OK, 9, 17, 9, 17, END);
  test_comp("(ananda){ 1i + 1d < 2 }", REG_EXTENDED, 0);
  test_exec("molasses anaconda foo bar baz smith anderson ",
        0, REG_NOMATCH);

  test_comp("(fuu){ +3 -3 ~5}", REG_EXTENDED, 0);
  test_exec("anaconda foo bar baz smith anderson",
        0, REG_OK, 9, 10, 9, 10, END);
  test_comp("(fuu){ +2 -2 ~5}", REG_EXTENDED, 0);
  test_exec("anaconda foo bar baz smith anderson",
        0, REG_OK, 9, 10, 9, 10, END);
  test_comp("(fuu){ +3 -3 ~}", REG_EXTENDED, 0);
  test_exec("anaconda foo bar baz smith anderson",
        0, REG_OK, 9, 10, 9, 10, END);

  test_comp("(laurikari){ #3, 1i + 1d < 3 }", REG_EXTENDED, 0);

  /* No cost limit. */
  test_comp("(foobar){~}", REG_EXTENDED, 0);
  test_exec("xirefoabralfobarxie", 0, REG_OK, 11, 16, 11, 16, END);

  /* At most two errors. */
  test_comp("(foobar){~2}", REG_EXTENDED, 0);
  test_exec("xirefoabrzlfd", 0, REG_OK, 4, 9, 4, 9, END);
  test_exec("xirefoabzlfd", 0, REG_NOMATCH);

  /* At most two inserts or substitutions and max two errors total. */
  test_comp("(foobar){+2#2~2}", REG_EXTENDED, 0);
  test_exec("oobargoobaploowap", 0, REG_OK, 5, 11, 5, 11, END);

  /* Find best whole word match for "foobar". */
  test_comp("\\<(foobar){~}\\>", REG_EXTENDED, 0);
  test_exec("zfoobarz", 0, REG_OK, 0, 8, 0, 8, END);
  test_exec("boing zfoobarz goobar woop", 0, REG_OK, 15, 21, 15, 21, END);

  /* Match whole string, allow only 1 error. */
  test_comp("^(foobar){~1}$", REG_EXTENDED, 0);
  test_exec("foobar", 0, REG_OK, 0, 6, 0, 6, END);
  test_exec("xfoobar", 0, REG_OK, 0, 7, 0, 7, END);
  /*
    This currently fails.
  test_exec("foobarx", 0, REG_OK, 0, 7, 0, 7, END);
  */
  test_exec("fooxbar", 0, REG_OK, 0, 7, 0, 7, END);
  test_exec("foxbar", 0, REG_OK, 0, 6, 0, 6, END);
  test_exec("xoobar", 0, REG_OK, 0, 6, 0, 6, END);
  test_exec("foobax", 0, REG_OK, 0, 6, 0, 6, END);
  test_exec("oobar", 0, REG_OK, 0, 5, 0, 5, END);
  test_exec("fobar", 0, REG_OK, 0, 5, 0, 5, END);
  test_exec("fooba", 0, REG_OK, 0, 5, 0, 5, END);
  test_exec("xfoobarx", 0, REG_NOMATCH);
  test_exec("foobarxx", 0, REG_NOMATCH);
  test_exec("xxfoobar", 0, REG_NOMATCH);
  test_exec("xfoxbar", 0, REG_NOMATCH);
  test_exec("foxbarx", 0, REG_NOMATCH);

  /* At most one insert, two deletes, and three substitutions.
     Additionally, deletes cost two and substitutes one, and total
     cost must be less than 4. */
  test_comp("(foobar){+1 -2 #3, 2d + 1s < 4}", REG_EXTENDED, 0);
  test_exec("3oifaowefbaoraofuiebofasebfaobfaorfeoaro",
        0, REG_OK, 26, 33, 26, 33, END);

  /* Partially approximate matches. */
  test_comp("foo(bar){~1}zap", REG_EXTENDED, 0);
  test_exec("foobarzap", 0, REG_OK, 0, 9, 3, 6, END);
  test_exec("fobarzap", 0, REG_NOMATCH);
  test_exec("foobrzap", 0, REG_OK, 0, 8, 3, 5, END);
  test_comp("^.*(dot.org){~}.*$", REG_EXTENDED, 0);
  test_exec("www.cnn.com 64.236.16.20\n"
        "www.slashdot.org 66.35.250.150\n"
        "For useful information, use www.slashdot.org\n"
        "this is demo data!\n",
        0, REG_OK, 0, 120, 93, 100, END);

  /* Approximate matching and back referencing cannot be used together. */
  test_comp("(foo{~})\\1", REG_EXTENDED, REG_BADPAT);

Original comment by Bill.Jan...@gmail.com on 9 Jun 2011 at 4:36

GoogleCodeExporter commented 9 years ago

I took the tre test code and turned it into Python.  You can adapt it a bit to 
regex.

Bill

Original comment by Bill.Jan...@gmail.com on 9 Jun 2011 at 5:42

Attachments:

tre-tests.py

GoogleCodeExporter commented 9 years ago

I've now re-cast the tests as a function you could drop into test_regex.py.  If 
the "tre" module is installed, it will use that; if not, it will try to use 
regex instead to run the tests.

Original comment by Bill.Jan...@gmail.com on 11 Jun 2011 at 7:16

Attachments:

test-approx.py

GoogleCodeExporter commented 9 years ago

Whoops -- found a bug (by inspection) in the regex-test code branch.  Updated.

Original comment by Bill.Jan...@gmail.com on 11 Jun 2011 at 7:19

Attachments:

test-approx.py

GoogleCodeExporter commented 9 years ago

What do you think about the notation in the {...}? Is it clear?

One idiosyncrasy I don't like is that in this:

    1i + 1d < 1

the maximum cost is 1, but it uses "<", not "<=".

On another issue, it looks like a pattern which is going to be used for 
approximate matching would need to be specially marked (a flag and/or presence 
of {...} notation) because certain tricks to improve the speed of exact 
matching aren't possible for approximate matching, and I don't think the user 
would want to lose those speedups when most matching is exact. Another 
possibility is for an unmarked pattern to be recompiled automatically for 
approximate matching on demand.

Original comment by re...@mrabarnett.plus.com on 16 Jun 2011 at 1:27

GoogleCodeExporter commented 9 years ago

> One idiosyncrasy I don't like is that in this:
>
>    1i + 1d < 1
>
> the maximum cost is 1, but it uses "<", not "<=".

Good point.  Since it was developed as a C library, I suspect there wasn't a 
lot of Pythonic thinking involved.

> What do you think about the notation in the {...}? Is it clear?

I guess the question is, can we improve on it?  I've been using TRE for a bit, 
so it's clear to me, but perhaps there is a better way of saying these things.

> On another issue, it looks like a pattern which is going to be used for 
approximate matching
> would need to be specially marked (a flag and/or presence of {...} notation) 
because certain 
> tricks to improve the speed of exact matching aren't possible for approximate 
matching,
> and I don't think the user would want to lose those speedups when most 
matching is exact.

Yep.  That seems appropriate to me.  Presence of {...} notation would seem to be
the preferable way of doing it.

Bill

Original comment by Bill.Jan...@gmail.com on 16 Jun 2011 at 4:37

GoogleCodeExporter commented 9 years ago

I discovered that it _is_ using "<" correctly.

I've decided to allow both "<" and "<=", with their usual meanings.

This is TRE: "(foobar){+1 -2 #3, 2d + 1s < 4}"
This will be regex: "(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}"

My latest build passes the tests, although it isn't as fast at approximate 
matching, and it's too prone to catastrophic backtracking. I'm going to see 
whether I can adapt the existing safeguards used with repeats.

Original comment by re...@mrabarnett.plus.com on 23 Jun 2011 at 7:37

GoogleCodeExporter commented 9 years ago

The regex version certainly seems easier to read...  a good Pythonic quality 
:-).

Original comment by Bill.Jan...@gmail.com on 24 Jun 2011 at 4:20

GoogleCodeExporter commented 9 years ago

After making some modifications and adding a simple heuristic, I've got fuzzy 
matching to work acceptably, but I'd be interested in any further test cases I 
can try which are closer to real-world use-cases before the next release.

Restriction: fuzzy matching doesn't work on group references or named lists 
(but that's not to say that they never will).

Original comment by re...@mrabarnett.plus.com on 25 Jun 2011 at 2:57

GoogleCodeExporter commented 9 years ago

I downloaded the latest version from PyPI (regex-0.1.20110623a), but I get an 
error when I try to use it:

p = regex.compile('(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}')
p = regex.compile('(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.5/site-packages/regex.py", line 266, in compile
    return _compile(pattern, flags, kwargs)
  File "/Library/Python/2.5/site-packages/regex.py", line 373, in _compile
    parsed = parse_pattern(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 300, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 319, in parse_sequence
    item = parse_item(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 354, in parse_item
    if not subpattern or min_count == max_count == 1:
NameError: global name 'subpattern' is not defined
>>>

Original comment by Bill.Jan...@gmail.com on 27 Jun 2011 at 3:47

GoogleCodeExporter commented 9 years ago

Fuzzy matching wasn't in that release. You're the only person to have reported 
a problem with it! (Definitely a bug, though...)

New release (regex-0.1.20110627), _with_ fuzzy matching, now on PyPI.

If all goes well, I'll be adding fuzzy named lists and group references in the 
next release.

Original comment by re...@mrabarnett.plus.com on 27 Jun 2011 at 4:14

GoogleCodeExporter commented 9 years ago

Excellent.  OK, let me try it out, and I'll see if I can come up with more test 
cases.

This kind of thing is heavily used in text mining applications over OCR'd 
document sets.  Mainly legal and medical, as far as I know.  The legal use case 
is scanning a huge discovery collection for specific terms of art, and for 
specific named parties (which is where fuzzy named lists would come in).  
Unfortunately, it's hard to get real-world data to try things on, because legal 
and medical documents are typically privileged info.

Original comment by Bill.Jan...@gmail.com on 27 Jun 2011 at 5:35

GoogleCodeExporter commented 9 years ago

I have a number of lists of real-world addresses obtained via OCR.  I'm trying 
to figure out how to turn them into a test for the fuzzy matching.  Any ideas?

Original comment by Bill.Jan...@gmail.com on 29 Jun 2011 at 4:20

GoogleCodeExporter commented 9 years ago

No, no ideas.

BTW, I'm currently testing fuzzy named lists and group references for the next 
release.

Original comment by re...@mrabarnett.plus.com on 30 Jun 2011 at 12:08

GoogleCodeExporter commented 9 years ago

The new release (regex-0.1.20110702) supports fuzzy named lists and group 
references.

Original comment by re...@mrabarnett.plus.com on 2 Jul 2011 at 5:14

GoogleCodeExporter commented 9 years ago

I'm back.

One of the uses for fuzzy named lists is finding specific cities, streets, etc. 
in OCR'd lists of addresses.  I'll whomp up a test case.

Original comment by Bill.Jan...@gmail.com on 7 Jul 2011 at 7:26

GoogleCodeExporter commented 9 years ago

I just tried this interesting feature and would like to ask about the handling 
of the fuzzy patterns.
Am I missing something or is there some kind of context sensitivity in the 
matching? Are possibly the starting/ending parts of the string treated in some 
special way?
Is there a specific order or hierarchy in which the error types are considered 
(e.g. in an unqualified ...{e} pattern?
I tried to figure something out via some naive trials (I confess, I wasn't able 
to comprehend the matching behaviour from the sources...).

>>> regex.findall("F{e}", "Fabcd")
['F', 'a', 'b', 'c', 'd', '']
>>> regex.findall("F{e}", "abFcd")
['F', 'c', 'd', '']
>>> regex.findall("F{e}", "abcdF")
['F', '']

Probably, the presence of a substring matching the pattern exactly limits the 
matching of others (preceeding it?). cf. the following with the above examples:
>>> regex.findall("F{e}", "abcd")
['a', 'b', 'c', 'd', '']

It appears towards the end of the string, there is sometimes more chance to 
match (even within the same substring - consistent to the above).

>>> regex.findall("F{e}", "abFcd abFcd")
['F', 'F', 'c', 'd', '']
>>> 

Would it be possible to specify the matching resolution in more detail in order 
to make the fuzziness more verifyable/predictable?

Many thanks for continuing enhancements; now with the unicode properties and 
fuzzy matching, i believe, the recursion and code embedding or maybe case 
conversion on sub() are probably the only remaining unsupported features 
present in some other implementations :-) - not that I would want to use them 
too often (maybe except for the last one...).

Regards,
   vbr

Original comment by Vlastimil.Brom@gmail.com on 21 Jul 2011 at 12:26

GoogleCodeExporter commented 9 years ago

Fuzzy matching applies to the preceding item in the same way as the 
quantifiers, so "F{e}" matches "F" with any number of errors permitted, so it 
could match _anything_. However, it will always try to find the best match, the 
match with the lowest cost.

regex.search("F{e}", "abFcd") will match "F" because that's the best match.

regex.findall performs multiple searches, each starting where the previous 
match finished, so with regex.findall("F{e}", "abFcd"):

1. It searches for the best match from the start, finding "F" (exact match, 
cost is 0).

2. It searches for the best match after "F", finding "c" (1 substitution, cost 
is 1).

3. It searches for the best match after "c", finding "d" (1 substitution, cost 
is 1).

4. It searches for the best match after "d", finding "" (1 deletion, cost is 1).

Surprising? Possibly. But the TRE regex engine finds the best match, and this 
implementation copies that behaviour.

An alternative behaviour would be to match anything whose cost is within the 
limits, but that can have surprising results too.

Given the regex "(cat){e<=1}" and the string "dog cat", it would find " cat" 
because  " cat" comes before "cat" and has only 1 error.

If regex.findall had the alternative behaviour, then regex.search would have it 
too, but usually you want to find the best match, so returning " cat" when 
"cat" is better would also be surprising.

Original comment by re...@mrabarnett.plus.com on 21 Jul 2011 at 2:04

GoogleCodeExporter commented 9 years ago

Thanks for the clarification, it indeed was kind of surprising to me, that the 
equally inexact elements was matched (or not) depending on the position with 
respect to the best match and I think, the alternative behaviour would be less 
surprising (to me), but if it the behaviour of other engines supporting this 
feature, it might be better to stick with it.
Normally one would use more complex pattern, possibly with error weights, 
defined word boundaries etc., which would limit the "surprises" quite a bit.
vbr

Original comment by Vlastimil.Brom@gmail.com on 21 Jul 2011 at 2:51

GoogleCodeExporter commented 9 years ago

Just another thoughts, as I see the notices on OCR checks and similar and given 
the context dependent behaviour - is it somehow unexpected or discouraged to 
use fuzzy matching for findall() or finditer(), as search() is supposed to give 
the first "optimal" result? (In the mentioned OCR, one would probably test 
separate words against a dictionary items and not within a larger text.)
It was indeed unclear to me as I saw in the spec: "any number of errors" for, 
say, "F", how it is resolved "in the face of ambiguity".

Would it eventually be possible to check anything within the given constraints 
also checking the overlaps and maybe also expose the results in the ascending 
order of the "costs"? (maybe a bit similar to the "overlapped" parameter or 
match.captures(...), which also expose rather "internal" steps of the matching)

But anyway, if the alternatives also have drawbacks (like performance etc) and 
would be different from other implementations, its probably beter to stick with 
the usual way.
It's rather me, trying to understand the matching rules for use on more or less 
known data (in contrast to OCR or spell check and the like, where some 
randomness is inherent and therefore it probably doesn't hurt to have kind of 
irregularity in the matching too).

Regards,
   vbr

Original comment by Vlastimil.Brom@gmail.com on 21 Jul 2011 at 6:49

GoogleCodeExporter commented 9 years ago

Original comment by re...@mrabarnett.plus.com on 6 Aug 2011 at 3:29

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Sorry for another question regarding fuzzy matching; it seems, I'm still 
missing something topical regarding this feature.
I tried to make the matches more consistent for the whole text (i.e. in order 
not to mask suboptimal valid results preceeding a better match - cf. comments 
20, 21 here). This can be quite straightforward, simply repeating the search in 
the substring before the first match of the previous search and collecting the 
results (the obvious drawbacks is the possibly irregular handling of anchors 
and lookarounds).

However, I was surprised by another issue - some matches, I'd have expected, 
are also missing, also after the best match and supposedly complying with the 
error cost: 

>>> import regex
>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
>>> regex.findall(r"\bad{e<=1}\b", txt)
['ad', 'ae']

(with my custom findall function I can also get the "words" preceeding "ad":
['ab', 'ac', 'ad', 'ae']

but there are still others missing; the following pattern only deals with 
alternation, as insertions and deletions aren't relevant in this sample and 
given the word boundaries:

>>> regex.findall(r"\ba.|.d\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
>>> 

Is there maybe some left-to-right bias in the algorithm, or am I missing 
something else? 

Possibly related to this, how are the empty strings matched?, e.g.:
>>> regex.findall(r"b{e<=1}", "abcd")
['b', 'c', 'd', '']

why are there no empty matches between each character (all of them cost one 
deletion), but only the last one?

Thanks in advance,
    vbr

Original comment by Vlastimil.Brom@gmail.com on 10 Aug 2011 at 11:09

GoogleCodeExporter commented 9 years ago

The regex r"\bad{e<=1}\b" is:

    \b       boundary
    a        literal "a"
    d{e<=1}  literal "d" with maximum of 1 error
    \b       boundary

In the first example, the best match in the text is "ad", and the best match in 
the remainder is "ae". There are no further matches which meet the constraint.

It always looks for the best match in (the remainder of) the text.

I think I'll have to change the current behaviour (it seems to be too 
confusing) and add a BEST flag for when you want the best match, unless Bill 
(the original requester) or anyone else has an objection.

To summarise the differences:

Example 1:

txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
regex.findall(r"\bad{e<=1}\b", txt)

Current:  ['ad', 'ae']
Proposed: ['ab', 'ac', 'ad', 'ae']

Example 2:

regex.findall(r"b{e<=1}", "abcd")

Current:  ['b', 'c', 'd', '']
Proposed: ['a', 'b', 'c', 'd', '']

Original comment by re...@mrabarnett.plus.com on 11 Aug 2011 at 1:16

GoogleCodeExporter commented 9 years ago

Thanks for the detailed answer; now I notice that a part of my surprise was 
obviously the missing parentheses, i.e. I meant:
>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"

>>> regex.findall(r"\b(ad){e<=1}\b", txt)
['ad', 'ae', 'bd', 'cd', 'ed']
to be roughly equivalent to: 
>>> regex.findall(r"\ba.|.d\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
(given the text and the pattern; with the proposed change to also match before 
the "optimal" match).

My naive approach is something like the following (sorry for the style and 
coding - there must be some elegant ways to deal with the function arguments 
duplication etc...)

# # # # # # # # # # # # # # # # # # # 

def fuzzy_findall(pattern, string, flags=0, pos=None, endpos=None, 
overlapped=False, concurrent=None, **kwargs):
    tmp_results = []
    string = string[0:endpos] # to replace endpos used in another way
    tmp_end_pos = len(string)
    while tmp_end_pos:
        tmp_results.append(regex.findall(pattern=pattern, string=string, flags=flags, pos=pos, endpos=tmp_end_pos, overlapped=overlapped, concurrent=concurrent, **kwargs))
        m = regex.search(pattern=pattern, string=string, flags=flags, pos=pos, endpos=tmp_end_pos, overlapped=overlapped, concurrent=concurrent, **kwargs)
        if m:
            tmp_end_pos = m.start()
        else:
            break
    tmp_results.reverse()
    results = [result for sublist in tmp_results for result in sublist]
    return results

# # # # # # # # # # # # # # # # # #

using this function I get e.g.

>>> fuzzy_findall(r"\b(ad){e<=1}\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']

but 
print fuzzy_findall(r"c{e<=1}", "abcd")
['a', 'b', '', 'c', 'd', '']
with the empty match for the end of the substring in the second trial-match 
before "c".

I actually had expected empty matches between all the single characters
['', 'a', '', 'b', '', 'c', '', 'd', '']

or is it again some misunderstanding on my part?

Of course, a native behaviour in the regex engine would be much preferred tu 
such fragile wrapper function (also because of the mentioned anchors, 
lokkarounds etc.)

Many thanks for your efforts,
   vbr

Original comment by Vlastimil.Brom@gmail.com on 11 Aug 2011 at 2:02

GoogleCodeExporter commented 9 years ago

After the proposed change:

>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
>>> regex.findall(r"\b(ad){e<=1}\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
>>> regex.findall(r"c{e<=1}", "abcd")
['a', 'b', 'c', 'd', '']

In the second example:

    matches "a" with 1 error
    matches "b" with 1 error
    matches "c" with 0 errors
    matches "d" with 1 error
    matches "" at the end with 1 error

Substitution is tried first, then insertion, then deletion.

Original comment by re...@mrabarnett.plus.com on 11 Aug 2011 at 2:22

GoogleCodeExporter commented 9 years ago

Thanks, the proposed behaviour looks really promissing and (at least for me) 
much more consistent.
The deletion being found only at the end, and not during the whole string at 
character boundaries displays maybe some asymmetry, but it's clear why from 
your explanation and these empty matches are probably not likely to be useful 
anyway.

regards,
    vbr

Original comment by Vlastimil.Brom@gmail.com on 11 Aug 2011 at 7:35

GoogleCodeExporter commented 9 years ago

The regex module now looks for the first match.

Use the BESTMATCH flag to look for the best match.

Original comment by re...@mrabarnett.plus.com on 27 Sep 2011 at 10:26

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Lately, I have used fuzzy matching for some error checking and it proved to be 
very helpful;
however, I'd like to better understand the fuzzy matching process and its 
rules, to be able to reflect it in the patterns.
(Unfortunately, I was not able to learn much from the source, as most of the 
"magic" apparently happens in the C code (or was it python beyond my competence 
level? :-)

Comment 28 above clarifies the order of the error types being checked: 
"Substitution is tried first, then insertion, then deletion."

Now, I would be especially interested, how the segmentation of the matches is 
designed, if there are large or unlimited number of errors.
It is likely, that the matches without any error are taken as a base, which are 
then tried for variance defined in the cost-pattern.
Is it the case, that the match-possibilities are determined first and the 
allowed errors don't change/shift these "boundaries" for possible matches?

It seems, that e.g. the quantifier allowing possibly longer matches somehow 
"eats up" the possibilities for checking within the error cost: 

>>> regex.findall(r"X{e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(X+){e}","abc")
[]
>>> regex.findall(r"(a+){e}","abc")
['a']
>>> 

Does the algorithm work this way or am I misinterpretting these simplified 
results?

Just another thing I observed - without parentheses (which seem to be 
preferable in most cases anyway) - the cost pattern between {} seems to apply 
for the preceding "literal" only, e.g. also excluding quantifiers (?)

>>> regex.findall(r"a{e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"a{1}{e}","abc") # the cost is possibly applied for the 
empty string after the quantifier (?)
[]
>>> regex.findall(r"(a{1}){e}","abc")
['a', 'b', 'c', '']
>>> 

as an interesting and probably useless cornercase I found an empty pattern 
which match the empty string at character boundaries - without the cost pattern 
and nothing with any errors allowed

>>> regex.findall(r"","abc")
['', '', '', '']
>>> regex.findall(r"{e}","abc")
[]
as there is maybe no useful text for the error checking to begin with?

However, a similar zero width pattern matches identically:
>>> regex.findall(r"Q{0}","abc")
['', '', '', '']
>>> regex.findall(r"(Q{0}){e}","abc")
['', '', '', '']

>>> 

This is meant purely as a question about the matching behaviour;  in practice I 
am very happy with the present functionality; I just feel, that I could use it 
even better with an appropriate understanding of the "segmentation" for the 
possible matches tried. I would simply like to understand, how much "erroneous" 
cases I can match with this feature (not just in the extreme case of {e}, which 
would conceptually match anything, but obviously doesn't).

Many thanks in advance.
  vbr

Original comment by Vlastimil.Brom@gmail.com on 3 Oct 2011 at 9:40

GoogleCodeExporter commented 9 years ago

As it tries to match, if a part of the pattern fails to match, for example, "a" 
doesn't match "b", it assumes that there's an error, whether a substitution, 
insertion or deletion, and then continues.  It tries each type of error in 
turn, checking whether the error constraints allow it.

The error constraint follows the same syntactic rule as the quantifiers, 
applying to the preceding item. If there's no preceding item, it's treated as a 
literal, for example, r"{e}" is treated the same as r"\{e}":

>>> regex.findall(r"{e}","abc{e}")
['{e}']

I do note that the quantifiers raise an exception if there's no preceding item, 
so it could be argued that the error constraint should do the same.

Original comment by re...@mrabarnett.plus.com on 3 Oct 2011 at 11:30

GoogleCodeExporter commented 9 years ago

Thanks for the prompt answer;
the fallback to literal makes the relevant pattern clear, as does the analogy 
to quantifiers in the other cases.

However, how do quantifiers and error costs work together?
It now seems to me, that the two should be combine rather carefully for the 
same subpattern. (or is it discouraged at all?)
In some elementary patterns (below) it seems that the quantifier somehow 
hinders matching a (suboptimal) pattern (likely satisfying the general 
constraint {e}), which is found if equivalent simple literals are used cf. e.g. 
(Q{1,2}){e} vs (Q){e} , (QQ){e} ...

>>> regex.findall(r"(a+){e}","abc")
['a']
>>> regex.findall(r"(Q+){e}","abc")
[]
>>> regex.findall(r"(Q){e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(QQ){e}","abc")
['ab', 'c', '']
>>> regex.findall(r"(QQQ){e}","abc")
['abc', '']
>>> regex.findall(r"(QQQQ){e}","abc")
['abc', '']
>>> regex.findall(r"(a{1,5}){e}","abc")
['a']
>>> regex.findall(r"(Q{1,1}){e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(Q{1,2}){e}","abc")
[]

I see, that such patterns are not very real nor useful, but I just I should 
understand the matching possibilities better.

Regards,
   vbr

Original comment by Vlastimil.Brom@gmail.com on 4 Oct 2011 at 12:04

GoogleCodeExporter commented 9 years ago

That's a bug.

It optimises repeats of single-character matches (a character, a set, or a 
property). Unfortunately, that doesn't work as you'd expect when using fuzzy 
matching. :-(

Fixed in regex 0.1.20111004.

Original comment by re...@mrabarnett.plus.com on 4 Oct 2011 at 1:26

GoogleCodeExporter commented 9 years ago

Thank you very much, in the updated version (regex-0.1.20111004), the fuzzy 
matching is a bit more predictable for me,
however, I still don't understand e.g.:

>>> regex.findall(r"\m(?:[^b]+){e}\M", "abc cde dba")
['cde', 'dba']
>>> regex.findall(r"\m(?:[^b]+){e}\M", "abc dba cde dba")
['abc', 'cde', 'dba']
>>> 

I'd think, that all of the "words" would match at most with one error; however 
the more difficult aspect is the different handling of "abc" vs. "dba", which 
both need one substitution to match;
even less clear to me is the second example - adding "dba" lets the previous 
"abc"  match, but then, only the second "dba"  is matched and not this one in 
the 2nd word position.
Is it a peculiarity of the fuzzy matching or of the word-beginning/end anchors?

vbr

Original comment by Vlastimil.Brom@gmail.com on 4 Oct 2011 at 11:11

GoogleCodeExporter commented 9 years ago

When it finds a fuzzy match it tries to see whether it can get a better fit (a 
match with a lower cost) within that match.

The initial match for:

    regex.findall(r"\m(?:[^b]+){e}\M", "abc cde dba")

is actually "abc cde dba" (the cost is 2), but a better fit is "abc cde" (the 
cost is 1), and an even better fit is "cde" (the cost is 0).

This results in the final output of ['cde', 'dba'].

If it didn't look for a better fit the final output would be ["abc cde dba"].

Let me give you another example:

>>> regex.findall(r"\m(?:[a-z]+){e}\M", "abc cde dba")).
# With "better fit":
['abc', 'cde', 'dba']
# Without "better fit":
['abc cde dba']

If I change your example slightly:
>>> regex.findall(r"\m(?:[^b ]+){e}\M", "abc cde dba")
# With "better fit":
['abc', 'cde', 'dba']
# Without "better fit":
['abc cde dba']

And that's my justification. :-)

Original comment by re...@mrabarnett.plus.com on 5 Oct 2011 at 12:30

GoogleCodeExporter commented 9 years ago

Thanks for the explanation, the recursive readjusting the match for the lowest 
error cost indeed makes the most cases, which wasn't very clear to me, obvious.
Now I feel, I may even grasp the fuzzy matching rules - sooner or later...

Thanks and regards,
     vbr

Original comment by Vlastimil.Brom@gmail.com on 5 Oct 2011 at 1:25

jamadden / mrab-regex-hg

approximate matching -- feature request #12