postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

Limit crawl to links matching pattern #59

Closed bricemaurin closed 7 years ago

bricemaurin commented 7 years ago

Hi,

I want to crawl a website and reduce the time to crawl, so I'm trying to limit the pages to be crawled to only those i really need. To do so, I'd like to implement 2 rules:

  1. Limit to 10 pages max
  2. Limit to links where anchor text match a regex

Is it possible with Spidr ?

Thanks a lot, Brice

bricemaurin commented 7 years ago

I just found the "limit" param that will solve point 1., but still can't figure out how to limit the page crawled to those matching a regex.

Any idea ?

robfuller commented 7 years ago

checkout ignore_links - http://spidr.rubyforge.org/docs/Spidr/Filters.html#ignore_links-instance_method (exclude anything matching array of regex) - you can also set this on agent options ex agent_options[:ignore_links]

Alternatively check out http://spidr.rubyforge.org/docs/Spidr/Filters.html#visit_links-instance_method to set array of regex to ONLY include links that match - again you can also set this on agent options ex: agent_options[:links]

bricemaurin commented 7 years ago

Thanks !