Open jeremybmerrill opened 10 years ago
Scraper#index
will return a Scraper instance with (perhaps deferred for actual fetching later) on which a #scrape
call will fetch the links on the index specified by the selector expression. Scraper#instances
will return a Scraper instance on which a #scrape
call will fetch the links on the index specified in the argument to #instances
.
I think for 1.0.0 the Scraper returned by "index" will immediately fetch the index page, so that the Scraper can be added to other scrapers, see #35. For now, it'll still only be fetched on#scrape
.
I changed my mind in the last 31 minutes.
For 0.4.0 the semantics of #initialize
will change. The index page will be scraped immediately. However, the syntax will not change.
Hmm, if it makes requests on the first call (e.g. Scraper.new, Scraper.index), when are options set? I guess as a hash on that first call? That'll be a breaking change. So I'll cue that up for 1.0.0
Mostly implemented in future
(1.0.0) at https://github.com/propublica/upton/commit/a25e84e798d1c9f6175ea6fc7923ae48e43c5fbd
Partially implemented for 0.4.0 at https://github.com/propublica/upton/commit/24cb65ea3eb1e42b681bb479f1c4797eb8959ae5
That
Scraper.new
takes EITHER a url and a selector OR an array of URLs is confusing. Should keep both onnew
for backwards compatibility, but add a helper method for each pattern -- and use those helper methods in the README.This will hopefully allay some of the confusion in #30 and address the API problems that were mentioned in #5 without such a dramatic refactor.