propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

Create ScrapedPage object #32

Open jeremybmerrill opened 10 years ago

jeremybmerrill commented 10 years ago

Which is what would be yielded out of Scraper#scrape instead of the HTML, the URL, and instance page's index, etc.

This ScrapedPage object -- which might inherit from Nokogiri::HTML -- would contain the raw HTML, the parsed HTML, the URL, the index page from which the instance page was linked (if present), a reference to the index page's ScrapedPage object, and the instance page's index (i.e. ordinal count) of pages linked to from the index page.

This would be a breaking change, so is farther away from being implemented into stable Upton.

jeremybmerrill commented 10 years ago

Implemented in future (for 1.0.0) in https://github.com/propublica/upton/commit/31cbf413583816c138f9228eed3688333096cd9b

Will be minimally breaking, since missing methods on Page are passed through to Nokogiri::HTML.

Maybe I should implement this even-less-breakingly in 0.4.0 by still passing the instance_index, instance_url, etc. attrs through to blk.call?