Closed dragondave closed 12 years ago
You're right, there should be a way to initialize just from the html (or HtmlPage), certainly not requiring a URL to be downloaded with urllib.
We'll refactor that class, there's a few things we'd like to improve with it.
You can now train using HtmlPage
objects, which can be constructed from unicode. Instead of Scraper.train
, use Scraper.train_from_htmlpage
.
If your framework/library doesn't support decoding of html responses to unicode, you can use the support I recently added to w3lib. See this example of making an HtmlPage from urllib2: https://github.com/scrapy/scrapely/blob/master/scrapely/htmlpage.py#L11
http://groups.google.com/group/scraperwiki/browse_thread/thread/d750d093ca5220bf ... was posted, wanting to use Mechanize to download HTML [since the data was behind a login] and Scrapely to parse it.
As far as I can see, Scrapely doesn't support that.
I've made https://scraperwiki.com/scrapers/scrapely-hack/ to try to work around that.
The core change is in
Scraper._get_page
where:is added before
, an optional 'html' parameter is added to
Scraper.scrape
,.train
and_get_page
[and passed to_get_page
], and the 'url' parameter is made optional.