scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 273 forks source link

Specifying integer values in the data dict #16

Closed buzypi closed 12 years ago

buzypi commented 12 years ago

Amazing work! This is really useful.

I ran into a minor issue with the way you provide data. The documentation does not say you can't provide integer values, so I ended up providing this data:

In [1]: from scrapely import Scraper

In [2]: s = Scraper()

In [3]: data = {'name': 'scrapy/scrapely', 'url': 'https://github.com/scrapy/scrapely', 'description': 'A pure-python HTML screen-scraping library', 'watchers': 42, 'forks': 9}

In [4]: url = "https://github.com/scrapy/scrapely"

and ran into this exception:

In [5]: s.train(url, data)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

...

/home/ubuntu/scrapely/scrapely/template.py in func(fragment, page)
     93     def func(fragment, page):
     94         fdata = page.fragment_data(fragment).strip()
---> 95         if text in fdata:
     96             return float(len(text)) / len(fdata) - (1e-6 * fragment.start)
     97         else:

TypeError: 'in <string>' requires string as left operand

It took me a while to realize what the issue was, it was with the integer values in the data variable.

So, you can either make it all unicode string:

if unicode(text) in fdata:
    return float(len(unicode(text))) / len(fdata) - (1e-6 * fragment.start)

or specify in the documentation that values should all be strings.