mrabarnett / mrab-regex

Other
446 stars 49 forks source link

= for fuzzy matches #41

Closed mrabarnett closed 12 years ago

mrabarnett commented 12 years ago

Original report by Anonymous.


= operator could be pretty handy for fuzzy matches, finding only erroneous text. For example, in a list of hotmail email accounts, you could search for misspells like '@(hotmail.com){e=1}'. This will save the user an extra "grep -v" for filtering out correct emails in the list of matches.

mrabarnett commented 12 years ago

Original comment by Anonymous.


or even something like this would be more powerful:

'@(hotmail\.com){0<e<3}'

matching only texts with 1 or 2 errors

mrabarnett commented 12 years ago

Original comment by Anonymous.


A regex such as:

@(hotmail\.com){e=1}

can be we written as:

@(?>(hotmail\.com){e<=1})(?<!@hotmail\.com)

although that's not quite as convenient, I admit! :-)

Note that the fuzzy part needs to be in an atomic group in order to stop it backtracking to find a worse match. For example, given the string "@hotmail.comb", the fuzzy part will match "@hotmail.com" with 0 errors, then the negative look-behind will reject it, so the fuzzy part will match "@hotmail.comb" with 1 error.

I'm not sure how easy it'll be to add a lower limit; such a problem could still occur.

mrabarnett commented 12 years ago

Original comment by Anonymous.


I think I've figured out how to do it, but how much demand is for it? You gave an example, but is that a real use case?

mrabarnett commented 12 years ago

Original comment by Anonymous.


I am fixing tags for 25k+ text documents for a web site, so I do have a real (different) use case. That was just an example. But I think it would be a really nice feature for regex module...

mrabarnett commented 12 years ago

Original comment by Anonymous.


Could you provide a few test cases?

mrabarnett commented 12 years ago

Original comment by Anonymous.


here is a real example translated into english

3 servic detection
1 service detect
5 service detecti
46 service detection
1 in service detection

The site has manually entered tags, and their frequencies from 25k+ (non-english) text documents. Most of the time the correct one has a high frequency, and anything that is close enough to a correct one (except itself) should probably get fixed..

mrabarnett commented 12 years ago

Original comment by Anonymous.


What fuzzy regex would you use to match the incorrect strings in your example? Would it be this:

(?:^\d+ service detection$){1<=e<=3}
mrabarnett commented 12 years ago

Original comment by Anonymous.


no no, the first part is the frequency of a tag, not part of it. I would search a match with:

r = compile(r'(?:service detection){0<e<5}')
m = r.match(str)
if m:
mrabarnett commented 12 years ago

Original comment by Anonymous.


or '(?:service detection){0<e<5}$' is also a possibility..

mrabarnett commented 12 years ago

Original comment by Anonymous.


Added in regex 0.1.20120119.

Note that it supports only constraints of the form e<=3 or 1<=e<=3 ("<" is also allowed), but not "=".

mrabarnett commented 12 years ago

Original comment by Anonymous.


thanks ^_^