postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

`ignore_links` not working. #64

Closed vwochnik closed 6 years ago

vwochnik commented 6 years ago

Hello. I am loving this library! But I have an issue.

I am collecting URLS of already scraped pages inside an array and for later continuation of the process am using ignore_links to skip them.

However, it's not working. The URLs are collected by page.url and are then fed into ignore_links later on as absolute URL strings. The page I am scraping references it's content by relative links.

linkregs = [] # regexes, working fine
ignore = [] # read from file
Spidr.start_at("http://example.com", links: linkregs, ignore_links: ignore) do |spidr|
  spidr.every_page do |page|
    if ignore.include?(page.url.to_s)
      # this is the problem
      puts "Error!!"
    end
    ignore.push(page.url.to_s)
  end
end
# save ignore to file
vwochnik commented 6 years ago

Fixed by not using ignore_links anymore and instead using a single Proc for the links rule and thereby having my custom logic decide whether or not to crawl the link. Should be mentioned somewhere though that once an accept rule is truthy, all reject rules are being ignored.

postmodern commented 6 years ago

@vwochnik do you think spidr should check the reject rules as well, even when one of the accept rules matches?

vwochnik commented 6 years ago

I don't know how this should work. The edge case that both an accept and a reject rule apply can either be true or false depending on whether the user wants to give the accept or reject rule priority. So either prioritize the rules or make a setting whether to prefer accept or reject rules.

postmodern commented 6 years ago

@vwochnik the relevant code: https://github.com/postmodern/spidr/blob/44fa099e80e4cb3a8604c5dffe7612a4c2999fcb/lib/spidr/rules.rb#L41-L47