Closed GoogleCodeExporter closed 9 years ago
See http://laurikari.net/tre/
Original comment by Bill.Jan...@gmail.com
on 3 Jun 2011 at 9:02
I'm not entirely sure how easy this would be, although I have come up with a
few ideas.
Could you provide me with some test data, including regex(es)?
Original comment by re...@mrabarnett.plus.com
on 8 Jun 2011 at 8:12
The code for TRE is under a "two-clause BSD licence", whatever that means. But
the approximate matching code is in a single C file there. Could it be lifted?
Original comment by ditsta...@gmail.com
on 9 Jun 2011 at 1:28
There are a lot of tests as part of TRE; would they be enough? Assuming you
use the TRE syntax.
Bill
/*
* Approximate matching tests.
*
* The approximate matcher always searches for the best match, and returns
* the leftmost and longest one if there are several best matches.
*/
test_comp("(fou){# ~1}", REG_EXTENDED, 0);
test_comp("(fuu){#}", REG_EXTENDED, 0);
test_comp("(fuu){# ~}", REG_EXTENDED, 0);
test_comp("(anaconda){ 1i + 1d < 1, #1}", REG_EXTENDED, 0);
test_comp("(anaconda){ 1i + 1d < 1 #1 ~10 }", REG_EXTENDED, 0);
test_comp("(anaconda){ #1, ~1, 1i + 1d < 1 }", REG_EXTENDED, 0);
test_comp("(znacnda){ #1 ~3 1i + 1d < 1 }", REG_EXTENDED, 0);
test_exec("molasses anaconda foo bar baz smith anderson ",
0, REG_NOMATCH);
test_comp("(znacnda){ #1 ~3 1i + 1d < 2 }", REG_EXTENDED, 0);
test_exec("molasses anaconda foo bar baz smith anderson ",
0, REG_OK, 9, 17, 9, 17, END);
test_comp("(ananda){ 1i + 1d < 2 }", REG_EXTENDED, 0);
test_exec("molasses anaconda foo bar baz smith anderson ",
0, REG_NOMATCH);
test_comp("(fuu){ +3 -3 ~5}", REG_EXTENDED, 0);
test_exec("anaconda foo bar baz smith anderson",
0, REG_OK, 9, 10, 9, 10, END);
test_comp("(fuu){ +2 -2 ~5}", REG_EXTENDED, 0);
test_exec("anaconda foo bar baz smith anderson",
0, REG_OK, 9, 10, 9, 10, END);
test_comp("(fuu){ +3 -3 ~}", REG_EXTENDED, 0);
test_exec("anaconda foo bar baz smith anderson",
0, REG_OK, 9, 10, 9, 10, END);
test_comp("(laurikari){ #3, 1i + 1d < 3 }", REG_EXTENDED, 0);
/* No cost limit. */
test_comp("(foobar){~}", REG_EXTENDED, 0);
test_exec("xirefoabralfobarxie", 0, REG_OK, 11, 16, 11, 16, END);
/* At most two errors. */
test_comp("(foobar){~2}", REG_EXTENDED, 0);
test_exec("xirefoabrzlfd", 0, REG_OK, 4, 9, 4, 9, END);
test_exec("xirefoabzlfd", 0, REG_NOMATCH);
/* At most two inserts or substitutions and max two errors total. */
test_comp("(foobar){+2#2~2}", REG_EXTENDED, 0);
test_exec("oobargoobaploowap", 0, REG_OK, 5, 11, 5, 11, END);
/* Find best whole word match for "foobar". */
test_comp("\\<(foobar){~}\\>", REG_EXTENDED, 0);
test_exec("zfoobarz", 0, REG_OK, 0, 8, 0, 8, END);
test_exec("boing zfoobarz goobar woop", 0, REG_OK, 15, 21, 15, 21, END);
/* Match whole string, allow only 1 error. */
test_comp("^(foobar){~1}$", REG_EXTENDED, 0);
test_exec("foobar", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("xfoobar", 0, REG_OK, 0, 7, 0, 7, END);
/*
This currently fails.
test_exec("foobarx", 0, REG_OK, 0, 7, 0, 7, END);
*/
test_exec("fooxbar", 0, REG_OK, 0, 7, 0, 7, END);
test_exec("foxbar", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("xoobar", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("foobax", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("oobar", 0, REG_OK, 0, 5, 0, 5, END);
test_exec("fobar", 0, REG_OK, 0, 5, 0, 5, END);
test_exec("fooba", 0, REG_OK, 0, 5, 0, 5, END);
test_exec("xfoobarx", 0, REG_NOMATCH);
test_exec("foobarxx", 0, REG_NOMATCH);
test_exec("xxfoobar", 0, REG_NOMATCH);
test_exec("xfoxbar", 0, REG_NOMATCH);
test_exec("foxbarx", 0, REG_NOMATCH);
/* At most one insert, two deletes, and three substitutions.
Additionally, deletes cost two and substitutes one, and total
cost must be less than 4. */
test_comp("(foobar){+1 -2 #3, 2d + 1s < 4}", REG_EXTENDED, 0);
test_exec("3oifaowefbaoraofuiebofasebfaobfaorfeoaro",
0, REG_OK, 26, 33, 26, 33, END);
/* Partially approximate matches. */
test_comp("foo(bar){~1}zap", REG_EXTENDED, 0);
test_exec("foobarzap", 0, REG_OK, 0, 9, 3, 6, END);
test_exec("fobarzap", 0, REG_NOMATCH);
test_exec("foobrzap", 0, REG_OK, 0, 8, 3, 5, END);
test_comp("^.*(dot.org){~}.*$", REG_EXTENDED, 0);
test_exec("www.cnn.com 64.236.16.20\n"
"www.slashdot.org 66.35.250.150\n"
"For useful information, use www.slashdot.org\n"
"this is demo data!\n",
0, REG_OK, 0, 120, 93, 100, END);
/* Approximate matching and back referencing cannot be used together. */
test_comp("(foo{~})\\1", REG_EXTENDED, REG_BADPAT);
Original comment by Bill.Jan...@gmail.com
on 9 Jun 2011 at 4:36
I took the tre test code and turned it into Python. You can adapt it a bit to
regex.
Bill
Original comment by Bill.Jan...@gmail.com
on 9 Jun 2011 at 5:42
Attachments:
I've now re-cast the tests as a function you could drop into test_regex.py. If
the "tre" module is installed, it will use that; if not, it will try to use
regex instead to run the tests.
Original comment by Bill.Jan...@gmail.com
on 11 Jun 2011 at 7:16
Attachments:
Whoops -- found a bug (by inspection) in the regex-test code branch. Updated.
Original comment by Bill.Jan...@gmail.com
on 11 Jun 2011 at 7:19
Attachments:
What do you think about the notation in the {...}? Is it clear?
One idiosyncrasy I don't like is that in this:
1i + 1d < 1
the maximum cost is 1, but it uses "<", not "<=".
On another issue, it looks like a pattern which is going to be used for
approximate matching would need to be specially marked (a flag and/or presence
of {...} notation) because certain tricks to improve the speed of exact
matching aren't possible for approximate matching, and I don't think the user
would want to lose those speedups when most matching is exact. Another
possibility is for an unmarked pattern to be recompiled automatically for
approximate matching on demand.
Original comment by re...@mrabarnett.plus.com
on 16 Jun 2011 at 1:27
> One idiosyncrasy I don't like is that in this:
>
> 1i + 1d < 1
>
> the maximum cost is 1, but it uses "<", not "<=".
Good point. Since it was developed as a C library, I suspect there wasn't a
lot of Pythonic thinking involved.
> What do you think about the notation in the {...}? Is it clear?
I guess the question is, can we improve on it? I've been using TRE for a bit,
so it's clear to me, but perhaps there is a better way of saying these things.
> On another issue, it looks like a pattern which is going to be used for
approximate matching
> would need to be specially marked (a flag and/or presence of {...} notation)
because certain
> tricks to improve the speed of exact matching aren't possible for approximate
matching,
> and I don't think the user would want to lose those speedups when most
matching is exact.
Yep. That seems appropriate to me. Presence of {...} notation would seem to be
the preferable way of doing it.
Bill
Original comment by Bill.Jan...@gmail.com
on 16 Jun 2011 at 4:37
I discovered that it _is_ using "<" correctly.
I've decided to allow both "<" and "<=", with their usual meanings.
This is TRE: "(foobar){+1 -2 #3, 2d + 1s < 4}"
This will be regex: "(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}"
My latest build passes the tests, although it isn't as fast at approximate
matching, and it's too prone to catastrophic backtracking. I'm going to see
whether I can adapt the existing safeguards used with repeats.
Original comment by re...@mrabarnett.plus.com
on 23 Jun 2011 at 7:37
The regex version certainly seems easier to read... a good Pythonic quality
:-).
Original comment by Bill.Jan...@gmail.com
on 24 Jun 2011 at 4:20
After making some modifications and adding a simple heuristic, I've got fuzzy
matching to work acceptably, but I'd be interested in any further test cases I
can try which are closer to real-world use-cases before the next release.
Restriction: fuzzy matching doesn't work on group references or named lists
(but that's not to say that they never will).
Original comment by re...@mrabarnett.plus.com
on 25 Jun 2011 at 2:57
I downloaded the latest version from PyPI (regex-0.1.20110623a), but I get an
error when I try to use it:
p = regex.compile('(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}')
p = regex.compile('(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.5/site-packages/regex.py", line 266, in compile
return _compile(pattern, flags, kwargs)
File "/Library/Python/2.5/site-packages/regex.py", line 373, in _compile
parsed = parse_pattern(source, info)
File "/Library/Python/2.5/site-packages/_regex_core.py", line 300, in parse_pattern
branches = [parse_sequence(source, info)]
File "/Library/Python/2.5/site-packages/_regex_core.py", line 319, in parse_sequence
item = parse_item(source, info)
File "/Library/Python/2.5/site-packages/_regex_core.py", line 354, in parse_item
if not subpattern or min_count == max_count == 1:
NameError: global name 'subpattern' is not defined
>>>
Original comment by Bill.Jan...@gmail.com
on 27 Jun 2011 at 3:47
Fuzzy matching wasn't in that release. You're the only person to have reported
a problem with it! (Definitely a bug, though...)
New release (regex-0.1.20110627), _with_ fuzzy matching, now on PyPI.
If all goes well, I'll be adding fuzzy named lists and group references in the
next release.
Original comment by re...@mrabarnett.plus.com
on 27 Jun 2011 at 4:14
Excellent. OK, let me try it out, and I'll see if I can come up with more test
cases.
This kind of thing is heavily used in text mining applications over OCR'd
document sets. Mainly legal and medical, as far as I know. The legal use case
is scanning a huge discovery collection for specific terms of art, and for
specific named parties (which is where fuzzy named lists would come in).
Unfortunately, it's hard to get real-world data to try things on, because legal
and medical documents are typically privileged info.
Original comment by Bill.Jan...@gmail.com
on 27 Jun 2011 at 5:35
I have a number of lists of real-world addresses obtained via OCR. I'm trying
to figure out how to turn them into a test for the fuzzy matching. Any ideas?
Original comment by Bill.Jan...@gmail.com
on 29 Jun 2011 at 4:20
No, no ideas.
BTW, I'm currently testing fuzzy named lists and group references for the next
release.
Original comment by re...@mrabarnett.plus.com
on 30 Jun 2011 at 12:08
The new release (regex-0.1.20110702) supports fuzzy named lists and group
references.
Original comment by re...@mrabarnett.plus.com
on 2 Jul 2011 at 5:14
I'm back.
One of the uses for fuzzy named lists is finding specific cities, streets, etc.
in OCR'd lists of addresses. I'll whomp up a test case.
Original comment by Bill.Jan...@gmail.com
on 7 Jul 2011 at 7:26
I just tried this interesting feature and would like to ask about the handling
of the fuzzy patterns.
Am I missing something or is there some kind of context sensitivity in the
matching? Are possibly the starting/ending parts of the string treated in some
special way?
Is there a specific order or hierarchy in which the error types are considered
(e.g. in an unqualified ...{e} pattern?
I tried to figure something out via some naive trials (I confess, I wasn't able
to comprehend the matching behaviour from the sources...).
>>> regex.findall("F{e}", "Fabcd")
['F', 'a', 'b', 'c', 'd', '']
>>> regex.findall("F{e}", "abFcd")
['F', 'c', 'd', '']
>>> regex.findall("F{e}", "abcdF")
['F', '']
Probably, the presence of a substring matching the pattern exactly limits the
matching of others (preceeding it?). cf. the following with the above examples:
>>> regex.findall("F{e}", "abcd")
['a', 'b', 'c', 'd', '']
It appears towards the end of the string, there is sometimes more chance to
match (even within the same substring - consistent to the above).
>>> regex.findall("F{e}", "abFcd abFcd")
['F', 'F', 'c', 'd', '']
>>>
Would it be possible to specify the matching resolution in more detail in order
to make the fuzziness more verifyable/predictable?
Many thanks for continuing enhancements; now with the unicode properties and
fuzzy matching, i believe, the recursion and code embedding or maybe case
conversion on sub() are probably the only remaining unsupported features
present in some other implementations :-) - not that I would want to use them
too often (maybe except for the last one...).
Regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 21 Jul 2011 at 12:26
Fuzzy matching applies to the preceding item in the same way as the
quantifiers, so "F{e}" matches "F" with any number of errors permitted, so it
could match _anything_. However, it will always try to find the best match, the
match with the lowest cost.
regex.search("F{e}", "abFcd") will match "F" because that's the best match.
regex.findall performs multiple searches, each starting where the previous
match finished, so with regex.findall("F{e}", "abFcd"):
1. It searches for the best match from the start, finding "F" (exact match,
cost is 0).
2. It searches for the best match after "F", finding "c" (1 substitution, cost
is 1).
3. It searches for the best match after "c", finding "d" (1 substitution, cost
is 1).
4. It searches for the best match after "d", finding "" (1 deletion, cost is 1).
Surprising? Possibly. But the TRE regex engine finds the best match, and this
implementation copies that behaviour.
An alternative behaviour would be to match anything whose cost is within the
limits, but that can have surprising results too.
Given the regex "(cat){e<=1}" and the string "dog cat", it would find " cat"
because " cat" comes before "cat" and has only 1 error.
If regex.findall had the alternative behaviour, then regex.search would have it
too, but usually you want to find the best match, so returning " cat" when
"cat" is better would also be surprising.
Original comment by re...@mrabarnett.plus.com
on 21 Jul 2011 at 2:04
Thanks for the clarification, it indeed was kind of surprising to me, that the
equally inexact elements was matched (or not) depending on the position with
respect to the best match and I think, the alternative behaviour would be less
surprising (to me), but if it the behaviour of other engines supporting this
feature, it might be better to stick with it.
Normally one would use more complex pattern, possibly with error weights,
defined word boundaries etc., which would limit the "surprises" quite a bit.
vbr
Original comment by Vlastimil.Brom@gmail.com
on 21 Jul 2011 at 2:51
Just another thoughts, as I see the notices on OCR checks and similar and given
the context dependent behaviour - is it somehow unexpected or discouraged to
use fuzzy matching for findall() or finditer(), as search() is supposed to give
the first "optimal" result? (In the mentioned OCR, one would probably test
separate words against a dictionary items and not within a larger text.)
It was indeed unclear to me as I saw in the spec: "any number of errors" for,
say, "F", how it is resolved "in the face of ambiguity".
Would it eventually be possible to check anything within the given constraints
also checking the overlaps and maybe also expose the results in the ascending
order of the "costs"? (maybe a bit similar to the "overlapped" parameter or
match.captures(...), which also expose rather "internal" steps of the matching)
But anyway, if the alternatives also have drawbacks (like performance etc) and
would be different from other implementations, its probably beter to stick with
the usual way.
It's rather me, trying to understand the matching rules for use on more or less
known data (in contrast to OCR or spell check and the like, where some
randomness is inherent and therefore it probably doesn't hurt to have kind of
irregularity in the matching too).
Regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 21 Jul 2011 at 6:49
Original comment by re...@mrabarnett.plus.com
on 6 Aug 2011 at 3:29
Sorry for another question regarding fuzzy matching; it seems, I'm still
missing something topical regarding this feature.
I tried to make the matches more consistent for the whole text (i.e. in order
not to mask suboptimal valid results preceeding a better match - cf. comments
20, 21 here). This can be quite straightforward, simply repeating the search in
the substring before the first match of the previous search and collecting the
results (the obvious drawbacks is the possibly irregular handling of anchors
and lookarounds).
However, I was surprised by another issue - some matches, I'd have expected,
are also missing, also after the best match and supposedly complying with the
error cost:
>>> import regex
>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
>>> regex.findall(r"\bad{e<=1}\b", txt)
['ad', 'ae']
(with my custom findall function I can also get the "words" preceeding "ad":
['ab', 'ac', 'ad', 'ae']
but there are still others missing; the following pattern only deals with
alternation, as insertions and deletions aren't relevant in this sample and
given the word boundaries:
>>> regex.findall(r"\ba.|.d\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
>>>
Is there maybe some left-to-right bias in the algorithm, or am I missing
something else?
Possibly related to this, how are the empty strings matched?, e.g.:
>>> regex.findall(r"b{e<=1}", "abcd")
['b', 'c', 'd', '']
why are there no empty matches between each character (all of them cost one
deletion), but only the last one?
Thanks in advance,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 10 Aug 2011 at 11:09
The regex r"\bad{e<=1}\b" is:
\b boundary
a literal "a"
d{e<=1} literal "d" with maximum of 1 error
\b boundary
In the first example, the best match in the text is "ad", and the best match in
the remainder is "ae". There are no further matches which meet the constraint.
It always looks for the best match in (the remainder of) the text.
I think I'll have to change the current behaviour (it seems to be too
confusing) and add a BEST flag for when you want the best match, unless Bill
(the original requester) or anyone else has an objection.
To summarise the differences:
Example 1:
txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
regex.findall(r"\bad{e<=1}\b", txt)
Current: ['ad', 'ae']
Proposed: ['ab', 'ac', 'ad', 'ae']
Example 2:
regex.findall(r"b{e<=1}", "abcd")
Current: ['b', 'c', 'd', '']
Proposed: ['a', 'b', 'c', 'd', '']
Original comment by re...@mrabarnett.plus.com
on 11 Aug 2011 at 1:16
Thanks for the detailed answer; now I notice that a part of my surprise was
obviously the missing parentheses, i.e. I meant:
>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
>>> regex.findall(r"\b(ad){e<=1}\b", txt)
['ad', 'ae', 'bd', 'cd', 'ed']
to be roughly equivalent to:
>>> regex.findall(r"\ba.|.d\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
(given the text and the pattern; with the proposed change to also match before
the "optimal" match).
My naive approach is something like the following (sorry for the style and
coding - there must be some elegant ways to deal with the function arguments
duplication etc...)
# # # # # # # # # # # # # # # # # # #
def fuzzy_findall(pattern, string, flags=0, pos=None, endpos=None,
overlapped=False, concurrent=None, **kwargs):
tmp_results = []
string = string[0:endpos] # to replace endpos used in another way
tmp_end_pos = len(string)
while tmp_end_pos:
tmp_results.append(regex.findall(pattern=pattern, string=string, flags=flags, pos=pos, endpos=tmp_end_pos, overlapped=overlapped, concurrent=concurrent, **kwargs))
m = regex.search(pattern=pattern, string=string, flags=flags, pos=pos, endpos=tmp_end_pos, overlapped=overlapped, concurrent=concurrent, **kwargs)
if m:
tmp_end_pos = m.start()
else:
break
tmp_results.reverse()
results = [result for sublist in tmp_results for result in sublist]
return results
# # # # # # # # # # # # # # # # # #
using this function I get e.g.
>>> fuzzy_findall(r"\b(ad){e<=1}\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
but
print fuzzy_findall(r"c{e<=1}", "abcd")
['a', 'b', '', 'c', 'd', '']
with the empty match for the end of the substring in the second trial-match
before "c".
I actually had expected empty matches between all the single characters
['', 'a', '', 'b', '', 'c', '', 'd', '']
or is it again some misunderstanding on my part?
Of course, a native behaviour in the regex engine would be much preferred tu
such fragile wrapper function (also because of the mentioned anchors,
lokkarounds etc.)
Many thanks for your efforts,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 11 Aug 2011 at 2:02
After the proposed change:
>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
>>> regex.findall(r"\b(ad){e<=1}\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
>>> regex.findall(r"c{e<=1}", "abcd")
['a', 'b', 'c', 'd', '']
In the second example:
matches "a" with 1 error
matches "b" with 1 error
matches "c" with 0 errors
matches "d" with 1 error
matches "" at the end with 1 error
Substitution is tried first, then insertion, then deletion.
Original comment by re...@mrabarnett.plus.com
on 11 Aug 2011 at 2:22
Thanks, the proposed behaviour looks really promissing and (at least for me)
much more consistent.
The deletion being found only at the end, and not during the whole string at
character boundaries displays maybe some asymmetry, but it's clear why from
your explanation and these empty matches are probably not likely to be useful
anyway.
regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 11 Aug 2011 at 7:35
The regex module now looks for the first match.
Use the BESTMATCH flag to look for the best match.
Original comment by re...@mrabarnett.plus.com
on 27 Sep 2011 at 10:26
Lately, I have used fuzzy matching for some error checking and it proved to be
very helpful;
however, I'd like to better understand the fuzzy matching process and its
rules, to be able to reflect it in the patterns.
(Unfortunately, I was not able to learn much from the source, as most of the
"magic" apparently happens in the C code (or was it python beyond my competence
level? :-)
Comment 28 above clarifies the order of the error types being checked:
"Substitution is tried first, then insertion, then deletion."
Now, I would be especially interested, how the segmentation of the matches is
designed, if there are large or unlimited number of errors.
It is likely, that the matches without any error are taken as a base, which are
then tried for variance defined in the cost-pattern.
Is it the case, that the match-possibilities are determined first and the
allowed errors don't change/shift these "boundaries" for possible matches?
It seems, that e.g. the quantifier allowing possibly longer matches somehow
"eats up" the possibilities for checking within the error cost:
>>> regex.findall(r"X{e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(X+){e}","abc")
[]
>>> regex.findall(r"(a+){e}","abc")
['a']
>>>
Does the algorithm work this way or am I misinterpretting these simplified
results?
Just another thing I observed - without parentheses (which seem to be
preferable in most cases anyway) - the cost pattern between {} seems to apply
for the preceding "literal" only, e.g. also excluding quantifiers (?)
>>> regex.findall(r"a{e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"a{1}{e}","abc") # the cost is possibly applied for the
empty string after the quantifier (?)
[]
>>> regex.findall(r"(a{1}){e}","abc")
['a', 'b', 'c', '']
>>>
as an interesting and probably useless cornercase I found an empty pattern
which match the empty string at character boundaries - without the cost pattern
and nothing with any errors allowed
>>> regex.findall(r"","abc")
['', '', '', '']
>>> regex.findall(r"{e}","abc")
[]
as there is maybe no useful text for the error checking to begin with?
However, a similar zero width pattern matches identically:
>>> regex.findall(r"Q{0}","abc")
['', '', '', '']
>>> regex.findall(r"(Q{0}){e}","abc")
['', '', '', '']
>>>
This is meant purely as a question about the matching behaviour; in practice I
am very happy with the present functionality; I just feel, that I could use it
even better with an appropriate understanding of the "segmentation" for the
possible matches tried. I would simply like to understand, how much "erroneous"
cases I can match with this feature (not just in the extreme case of {e}, which
would conceptually match anything, but obviously doesn't).
Many thanks in advance.
vbr
Original comment by Vlastimil.Brom@gmail.com
on 3 Oct 2011 at 9:40
As it tries to match, if a part of the pattern fails to match, for example, "a"
doesn't match "b", it assumes that there's an error, whether a substitution,
insertion or deletion, and then continues. It tries each type of error in
turn, checking whether the error constraints allow it.
The error constraint follows the same syntactic rule as the quantifiers,
applying to the preceding item. If there's no preceding item, it's treated as a
literal, for example, r"{e}" is treated the same as r"\{e}":
>>> regex.findall(r"{e}","abc{e}")
['{e}']
I do note that the quantifiers raise an exception if there's no preceding item,
so it could be argued that the error constraint should do the same.
Original comment by re...@mrabarnett.plus.com
on 3 Oct 2011 at 11:30
Thanks for the prompt answer;
the fallback to literal makes the relevant pattern clear, as does the analogy
to quantifiers in the other cases.
However, how do quantifiers and error costs work together?
It now seems to me, that the two should be combine rather carefully for the
same subpattern. (or is it discouraged at all?)
In some elementary patterns (below) it seems that the quantifier somehow
hinders matching a (suboptimal) pattern (likely satisfying the general
constraint {e}), which is found if equivalent simple literals are used cf. e.g.
(Q{1,2}){e} vs (Q){e} , (QQ){e} ...
>>> regex.findall(r"(a+){e}","abc")
['a']
>>> regex.findall(r"(Q+){e}","abc")
[]
>>> regex.findall(r"(Q){e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(QQ){e}","abc")
['ab', 'c', '']
>>> regex.findall(r"(QQQ){e}","abc")
['abc', '']
>>> regex.findall(r"(QQQQ){e}","abc")
['abc', '']
>>> regex.findall(r"(a{1,5}){e}","abc")
['a']
>>> regex.findall(r"(Q{1,1}){e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(Q{1,2}){e}","abc")
[]
I see, that such patterns are not very real nor useful, but I just I should
understand the matching possibilities better.
Regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 4 Oct 2011 at 12:04
That's a bug.
It optimises repeats of single-character matches (a character, a set, or a
property). Unfortunately, that doesn't work as you'd expect when using fuzzy
matching. :-(
Fixed in regex 0.1.20111004.
Original comment by re...@mrabarnett.plus.com
on 4 Oct 2011 at 1:26
Thank you very much, in the updated version (regex-0.1.20111004), the fuzzy
matching is a bit more predictable for me,
however, I still don't understand e.g.:
>>> regex.findall(r"\m(?:[^b]+){e}\M", "abc cde dba")
['cde', 'dba']
>>> regex.findall(r"\m(?:[^b]+){e}\M", "abc dba cde dba")
['abc', 'cde', 'dba']
>>>
I'd think, that all of the "words" would match at most with one error; however
the more difficult aspect is the different handling of "abc" vs. "dba", which
both need one substitution to match;
even less clear to me is the second example - adding "dba" lets the previous
"abc" match, but then, only the second "dba" is matched and not this one in
the 2nd word position.
Is it a peculiarity of the fuzzy matching or of the word-beginning/end anchors?
vbr
Original comment by Vlastimil.Brom@gmail.com
on 4 Oct 2011 at 11:11
When it finds a fuzzy match it tries to see whether it can get a better fit (a
match with a lower cost) within that match.
The initial match for:
regex.findall(r"\m(?:[^b]+){e}\M", "abc cde dba")
is actually "abc cde dba" (the cost is 2), but a better fit is "abc cde" (the
cost is 1), and an even better fit is "cde" (the cost is 0).
This results in the final output of ['cde', 'dba'].
If it didn't look for a better fit the final output would be ["abc cde dba"].
Let me give you another example:
>>> regex.findall(r"\m(?:[a-z]+){e}\M", "abc cde dba")).
# With "better fit":
['abc', 'cde', 'dba']
# Without "better fit":
['abc cde dba']
If I change your example slightly:
>>> regex.findall(r"\m(?:[^b ]+){e}\M", "abc cde dba")
# With "better fit":
['abc', 'cde', 'dba']
# Without "better fit":
['abc cde dba']
And that's my justification. :-)
Original comment by re...@mrabarnett.plus.com
on 5 Oct 2011 at 12:30
Thanks for the explanation, the recursive readjusting the match for the lowest
error cost indeed makes the most cases, which wasn't very clear to me, obvious.
Now I feel, I may even grasp the fuzzy matching rules - sooner or later...
Thanks and regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 5 Oct 2011 at 1:25
Original issue reported on code.google.com by
Bill.Jan...@gmail.com
on 2 Jun 2011 at 8:33