Closed cbmeeks closed 8 years ago
I suppose I could add optional support for http://github.com/parolkar/obey_robots_dot_txt/.
Or sometype of similar robots.txt parser.
We've got this code for ours (using http://github.com/fizx/robots):
robots = Robots.new(UA)
spidr = Spidr.site(url, :user_agent => UA, :ignore_exts => %w(js css jpg jpeg gif png txt)) do |s|
s.visit_urls_like do |url|
if robots.allowed?(url)
true
else
logger.error "Robots.txt disallowed #{url}"
false
end
end
#...etc...
end
I guess I could load the robots.txt file of a site but is there a way to turn this on so that it will always follow the rules?