pombredanne / esmre-legacy

Efficient String Matching Regular Expressions
http://code.google.com/p/esmre
GNU Lesser General Public License v2.1
0 stars 0 forks source link

Invalid regex match #11

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Esmre finds invalid regex matches that the vanilla re module shows do not exist.

The following unittest should pass, but does not due to the invalid match esmre 
finds.

import unittest
import re
import esmre

class Test(unittest.TestCase):

    def test_match(self):

        pattern1 = "what\s+cities\s+are\s+located\s+in\s+[a-zA-Z]+\?"
        pattern2 = "what\s+cities\s+are\s+near\s+[a-zA-Z]+\?"
        question1 = "what cities are located in China?"
        question2 = "what cities are near China?"

        # Build index.
        index = esmre.Index()
        index.enter(pattern1, 1)
        index.enter(pattern2, 2)

        # These pass
        self.assertEqual(re.findall(pattern1, question1), [question1])
        self.assertEqual(re.findall(pattern1, question2), [])
        self.assertEqual(re.findall(pattern2, question2), [question2])
        self.assertEqual(re.findall(pattern2, question1), [])

        # These fail?
        self.assertEqual(index.query(question1), [1])
        self.assertEqual(index.query(question2), [2])

if __name__ == '__main__':
    unittest.main()

Original issue reported on code.google.com by chrisspen@gmail.com on 20 Nov 2012 at 5:17

GoogleCodeExporter commented 8 years ago
Here are some other simple regular expressions that are broken in esmre-0.3.1:

#Parentheses
>>> index = esmre.Index()
>>> index.enter("too(l|th)",1)
>>> index.query("too")
[1]

#brackets
>>> index = esmre.Index()
>>> index.enter("wa[tv]er",1)
>>> index.query("wa")
[1]
>>> index.query("wage")
[1]
#brackets

Original comment by quant...@gmail.com on 16 May 2014 at 11:08

GoogleCodeExporter commented 8 years ago
is this a bug?

Original comment by wkwl880...@gmail.com on 5 Jun 2015 at 7:38

GoogleCodeExporter commented 8 years ago
Yes, it looks like a major bug surrounding nondeterminism/branched expressions.

Original comment by quant...@gmail.com on 5 Jun 2015 at 9:31