Scraper refactor - Githubissues

scrapy / scrapely

A pure-python HTML screen-scraping library

1.86k stars 273 forks source link

Scraper refactor #19

Closed shaneaevans closed 12 years ago

shaneaevans commented 12 years ago

The Scraper class can be trained with an HtmlPage instead of requiring a URL. It's more correct now (handling encoding, headers, etc.) when creating the HtmlPage for training.

The InstanceBasedLearningExtractor is no longer re-initialized on each request, improving performance.

A failing test has been fixed and now does not require to make an HTTP request to perform the test.

pablohoffman commented 12 years ago

Have you checked the scrapely command line tool (python -m scrapely.tool) keeps working after this change?

shaneaevans commented 12 years ago

The change is API compatible, unless it relies on private functions, it should be fine. I checked some basic usage and it was OK. (although, really, this should be automated..)

shaneaevans commented 12 years ago

I guess we should also 'fix' the tool. It requires users to tell it the encoding or it assumes utf8, where it should work out the encoding instead for the default case. I'll work up a patch..

I also note that the example on the README is broken - a 0 Scrapy project -n 1 -f author doesn't work for me