nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
996 stars 158 forks source link

Early suite exit #7

Open nrabinowitz opened 12 years ago

nrabinowitz commented 12 years ago

Use case:

Let's call http://www.example.com/ as "root". "root" contains links to root.1, root.2, root.3...root.250 (see hermitageart.com...an actual example with 260 links!!!). Each of these 250 links contain links to other pages. If my feature of interest was found only in root.3 and root.102, then ideally I would have liked root.4, root.5,....root.250 to not be accessed, i.e. page.open should not be called on them.

I think this would need to be addressed by setting a flag (maybe on the _pjs.state object?) to end the suite early, which could be checked in the page completion callback, emptying out the array of still-to-scrape pages. Question: this only affects the current level of recursion. Is that good? Do we need an early exit from the entire suite?

nrabinowitz commented 12 years ago

Better option here:

It's relatively simple to end the current suite (set its urls array to []) and to end all suites (that, plus setting suites = []). Killing the ancestor of the current suite in a recursive situation might be more difficult - it's worth thinking about whether I'd want/need an actual tree structure to manage the suites if I wanted more fine-grained control.