scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 272 forks source link

ZeroDivisionError when training with zero-length data #49

Open haywhisksoftware opened 10 years ago

haywhisksoftware commented 10 years ago

(Minor bug.) I installed scrapely from pip this morning.

This is a wacky edge case, but I think you could raise a more constructive error.

(Who wants to extract a zero-length string from a document? It's a bit like a magician pulling some atmosphere out of a hat: it's always going to be there...)

Check it out:

In [97]: from scrapely import Scraper

In [98]: s = Scraper()

In [99]: s.train('http://www.google.com', {'image': u''})
- - - - - - - - - - - - - - - - -
ZeroDivisionError                         Traceback (most recent call last)
/home/username/myfolder/<ipython-input-99-233d0ac90e7f> in <module>()
----> 1 s.train('http://www.google.com', {'image': u''})

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train(self, url, data, encoding)
     44     def train(self, url, data, encoding=None):
     45         page = url_to_page(url, encoding)
---> 46         self.train_from_htmlpage(page, data)
     47 
     48     def scrape(self, url, encoding=None):

/usr/local/lib/python2.7/dist-packages/scrapely/__init__.pyc in train_from_htmlpage(self, htmlpage, data)
     39                 if isinstance(value, str):
     40                     value = value.decode(htmlpage.encoding or 'utf-8')
---> 41                 tm.annotate(field, best_match(value))
     42         self.add_template(tm.get_template())
     43 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in annotate(self, field, score_func, best_match)
     31 
     32         """
---> 33         indexes = self.select(score_func)
     34         if not indexes:
     35             raise FragmentNotFound("Fragment not found annotating %r using: %s" % 

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in select(self, score_func)
     46         matches = []
     47         for i, fragment in enumerate(htmlpage.parsed_body):
---> 48             score = score_func(fragment, htmlpage)
     49             if score:
     50                 matches.append((score, i))

/usr/local/lib/python2.7/dist-packages/scrapely/template.pyc in func(fragment, page)
     95         fdata = page.fragment_data(fragment).strip()
     96         if text in fdata:
---> 97             return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
     98         else:
     99             return 0.0

ZeroDivisionError: float division by zero
ironmaniiith commented 8 years ago

This is the reason for the error.

return float(len(text)) / len(fdata) - (1e-6 * fragment.start)

If the float that is being returned is inversely proportional to length of fdata, can we just write this.?

fdata = page.fragment_data(fragment).strip()
if text in fdata:
    if not len(fdata):
        return float("inf")
    return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
else:
    return 0.0
return func
moneypython commented 8 years ago

This isn't a wacky edge-case at all.

I got the same error using actual data and had to patch it.

marekyggdrasil commented 4 years ago

Same here, I reproduced this error using regular, non-empty data.

marekyggdrasil commented 4 years ago

the patch has been merged, I believe this issue can be closed?