postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

Any way to automatically obey robots.txt? #11

Closed cbmeeks closed 8 years ago

cbmeeks commented 14 years ago

I guess I could load the robots.txt file of a site but is there a way to turn this on so that it will always follow the rules?

postmodern commented 14 years ago

I suppose I could add optional support for http://github.com/parolkar/obey_robots_dot_txt/.

postmodern commented 14 years ago

Or sometype of similar robots.txt parser.

perplexes commented 14 years ago

We've got this code for ours (using http://github.com/fizx/robots):

robots = Robots.new(UA)
spidr = Spidr.site(url, :user_agent => UA, :ignore_exts => %w(js css jpg jpeg gif png txt)) do |s|
  s.visit_urls_like do |url|
    if robots.allowed?(url)
      true
    else
      logger.error "Robots.txt disallowed #{url}"
      false
    end
  end

  #...etc...
end