propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.61k stars 112 forks source link

Helper methods for scraping one page and for scraping multiple #31

Open jeremybmerrill opened 10 years ago

jeremybmerrill commented 10 years ago

That Scraper.new takes EITHER a url and a selector OR an array of URLs is confusing. Should keep both on new for backwards compatibility, but add a helper method for each pattern -- and use those helper methods in the README.

This will hopefully allay some of the confusion in #30 and address the API problems that were mentioned in #5 without such a dramatic refactor.

jeremybmerrill commented 10 years ago

Scraper#index will return a Scraper instance with (perhaps deferred for actual fetching later) on which a #scrape call will fetch the links on the index specified by the selector expression. Scraper#instances will return a Scraper instance on which a #scrape call will fetch the links on the index specified in the argument to #instances.

jeremybmerrill commented 10 years ago

I think for 1.0.0 the Scraper returned by "index" will immediately fetch the index page, so that the Scraper can be added to other scrapers, see #35. For now, it'll still only be fetched on#scrape.

jeremybmerrill commented 10 years ago

I changed my mind in the last 31 minutes.

For 0.4.0 the semantics of #initialize will change. The index page will be scraped immediately. However, the syntax will not change.

jeremybmerrill commented 10 years ago

Hmm, if it makes requests on the first call (e.g. Scraper.new, Scraper.index), when are options set? I guess as a hash on that first call? That'll be a breaking change. So I'll cue that up for 1.0.0

jeremybmerrill commented 10 years ago

Mostly implemented in future (1.0.0) at https://github.com/propublica/upton/commit/a25e84e798d1c9f6175ea6fc7923ae48e43c5fbd

Partially implemented for 0.4.0 at https://github.com/propublica/upton/commit/24cb65ea3eb1e42b681bb479f1c4797eb8959ae5