thinker007 / templatemaker

Automatically exported from code.google.com/p/templatemaker
0 stars 0 forks source link

Python limitation on examples. #4

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Using the two Yelp.com websites given as examples on Adrian's blog post,
templatemaker runs into Python's _sre limitations.

Here's my interactive interpreter session:
>>> from templatemaker import Template
>>> import urllib2
>>> t = Template()
>>>
t.learn(urllib2.urlopen('http://www.yelp.com/biz/rp17Dfjdh7JR4GGZwj6nqg').read()
)
>>>
t.learn(urllib2.urlopen('http://www.yelp.com/biz/8vFJH_paXsMocmEO_KAa3w').read()
)
True
>>>
t.extract(urllib2.urlopen('http://www.yelp.com/biz/AqgG-1aD6JYj9D6OmBWO3w').read
())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/templatemaker.py", line 78, in extract
    m = re.search(regex, text)
  File "/usr/lib/python2.5/re.py", line 134, in search
    return _compile(pattern, flags).search(string)
  File "/usr/lib/python2.5/re.py", line 231, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.5/sre_compile.py", line 518, in compile
    "sorry, but this version only supports 100 named groups"
AssertionError: sorry, but this version only supports 100 named groups

Original issue reported on code.google.com by flo...@gmail.com on 12 Jul 2007 at 8:56

GoogleCodeExporter commented 9 years ago
I'm getting this too.  I printed out num_holes() just before calling extract(), 
looks
like my template had 254.  If I get a chance I'll take a closer look at the 
code and
try writing a patch to use a string splitting approach (this might actually be 
faster
than re.search, just a hunch).  

As an aside, supposedly this 100 group limit accidentally made it into 2.4 final
(says google), and is probably now removed.  However, it's possible that a 
pattern of
this size might still segfault the re engine.

Original comment by kumar.mcmillan on 17 Aug 2007 at 4:48

GoogleCodeExporter commented 9 years ago
I meant, I wasn't getting this on the examples but with building my own 
template from
a site that lists product results in a table.

Original comment by kumar.mcmillan on 17 Aug 2007 at 4:49

GoogleCodeExporter commented 9 years ago
I get it too in python2.5. Any more updates?

p.s num_holes() is the funniest function I've called all week!

Original comment by remarkability@gmail.com on 18 Mar 2008 at 7:25

GoogleCodeExporter commented 9 years ago
I guess if we were clever we could compile in a whole new regex engine.... 
presumably... ah well...

Original comment by remarkability@gmail.com on 18 Mar 2008 at 7:26