scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 273 forks source link

Support for passing HTML, not just URLs #17

Closed dragondave closed 12 years ago

dragondave commented 12 years ago

http://groups.google.com/group/scraperwiki/browse_thread/thread/d750d093ca5220bf ... was posted, wanting to use Mechanize to download HTML [since the data was behind a login] and Scrapely to parse it.

As far as I can see, Scrapely doesn't support that.

I've made https://scraperwiki.com/scrapers/scrapely-hack/ to try to work around that.

The core change is in Scraper._get_page where:

if html:
    body=html.decode(encoding)
else:

is added before

    body = urllib.urlopen(url).read().decode(encoding)

, an optional 'html' parameter is added to Scraper.scrape, .train and _get_page [and passed to _get_page], and the 'url' parameter is made optional.

shaneaevans commented 12 years ago

You're right, there should be a way to initialize just from the html (or HtmlPage), certainly not requiring a URL to be downloaded with urllib.

We'll refactor that class, there's a few things we'd like to improve with it.

shaneaevans commented 12 years ago

You can now train using HtmlPage objects, which can be constructed from unicode. Instead of Scraper.train, use Scraper.train_from_htmlpage.

If your framework/library doesn't support decoding of html responses to unicode, you can use the support I recently added to w3lib. See this example of making an HtmlPage from urllib2: https://github.com/scrapy/scrapely/blob/master/scrapely/htmlpage.py#L11