postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

Adds optionable support for obeying robots.txt #39

Closed buren closed 8 years ago

buren commented 9 years ago

Usage:

# To have this work add "gem 'robots', '~> 0.1'" to Gemfile
Spidr.site(
  'http://matasano.com/',
  :robots => true
)

Closes #11

postmodern commented 8 years ago

Eh, I prefer to ignore robots.txt and spider everything. However, I do like the idea of providing the option though. Instead of adding it to allowed_links, I think it should be added to Spidr::Agent#visit?.

buren commented 8 years ago

Something like e2c2883? Its not finished yet, I haven't added any tests yet. Just wanted to get your feedback first :)

postmodern commented 8 years ago

Yeah, but get rid of the NoRobots null object. Just do a @robots && @robots.allow?(uri). Note that Robots#allowed? accepts URIs. Also, should robots be all or nothing, or configurable per host?

buren commented 8 years ago

I removed the null object, and did @robots ? @robots.allowed?(uri) : true to have it return true if options[:robots] is falsy.

My guess is that in most cases you would either respect it for all hosts or just ignore robots.txt completely. What do you think?