postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
805 stars 109 forks source link

unable to ignore links #60

Closed vanegomez closed 7 years ago

vanegomez commented 7 years ago

cool gem!

I'm trying to ignore going to partners and everything after it in my site www.mysite.com/partners/resellers and is still going to those links.

   root = args[:url]

    url_map = Hash.new { |hash,key| hash[key] = [] }

    spider = Spidr.site(root, ignore_links_like: [%{^/partners/}]) do |spider|
      spider.every_url { |url| puts (url) }
      spider.every_failed_url { |url| puts "Failed url #{url}" }
      spider.every_link do |origin,dest|
        url_map[dest] << origin
      end
    end

    spider.failures.each do |url|
      puts "Broken link #{url} found in:"

      url_map[url].each { |page| puts ("  #{page}").red }
    end
postmodern commented 7 years ago

In spidr, links are the String version of the full URL. You appear to want to ignore links based on the path. Maybe something like:

spider.ignore_urls_like { |url| url.path.start_with?('/partners/') }

I should probably add ignore_paths_like to cover that use-case.

vanegomez commented 7 years ago

@postmodern Thank you so much for answering.

Is it possible to follow external links and check if they are broken?

postmodern commented 7 years ago

You would have to explicitly call spider.get_page and check the responses, since the spider won't automatically follow off-site links.

vanegomez commented 7 years ago

thank you!