postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

How can I 'ignore everything except' a set of links #42

Closed DHarls17 closed 8 years ago

DHarls17 commented 8 years ago

In the Examples section, I've got the 'Do not spider certain links' operation to work.

In my case 'Spidr.site('http://www.parkers.co.uk/', :ignore_links => [/vans/])

Correctly displays all URLs except those starting with /vans/.

However, I really need to also 'ignore' all links EXCEPT for /vans/, so that ONLY /vans/ urls are displayed.

Is this possible?

There are too many possibilities to just add to the list of 'ignore_links', I really need to 'ignore everything except' for /vans/.

thanks!

robfuller commented 8 years ago

newbie here - but think using "links" instead of "ignore_links" will do it for you

DHarls17 commented 8 years ago

Hi robfuller

thanks for the reply, but the issue I'm having is that I want to ignore many different sets of links, way too many to individually 'ignore_links' them.

So rather than list them all (50+) using the 'ignore links' command, I'm trying to find an equivalent of a 'ignore all links except'.

robfuller commented 8 years ago

Maybe I am not understanding:

url = "http://www.hartehanks.com/about-harte-hanks"
opts = {
       :links => [/about-harte-hanks/]
     }
Spidr.start_at(url,opts ) do |spider|
   spider.every_page { |page|
      puts page.url.inspect
   }
end

Output: #<URI::HTTP http://www.hartehanks.com/about-harte-hanks> #<URI::HTTP http://www.hartehanks.com/about-harte-hanks/leadership> #<URI::HTTP http://www.hartehanks.com/about-harte-hanks/digital> #<URI::HTTP http://www.hartehanks.com/about-harte-hanks/community-service>

Where ever you start does need to be in the links list, otherwise there is nothing to scan.

postmodern commented 8 years ago

Oh, so I ended up implementing the accept/reject filters kind of weird. https://github.com/postmodern/spidr/blob/00f0ff1864d56bad1352aa43a925bc645f777351/lib/spidr/rules.rb#L41-L47

To solve your immediate problem, I would do something like:

Spidr.site("http://www.parkers.co.uk/vans/") do |agent|
  agent.ignore_urls_like do |url|
    !url.path.start_with?('/vans/')
  end
end

Ideally, it should be possible to to specify links: [%r{^/vans/}], ignore_links: [/./] together. Will probably add that to the spidr 1.0 road map.