Closed chr1s1 closed 8 years ago
Similarly to blacklisting a set of IPs you could employ similar logic for whitelisting IPs or perhaps even ranges. See https://github.com/kickstarter/rack-attack/wiki/Advanced-Configuration#blacklisting-from-railscache for more information.
You could keep the IPs or IP ranges in persistent store (file, db), and during app bootup, write the values to the cache store (and also flag them for persistence such as allowed in redis).
I think you can tell Google and Yahoo to keep below your limits, that way they'll never get blocked.
Also, I'm pretty sure that they'll automatically reduce the crawl rate if they receive a 429
.
https://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_directive https://support.google.com/webmasters/answer/48620?hl=en
Hi @chr1s1! Good question. @pisaacs & @sandstrom have good suggestions for effectively allowing crawlers.
To your specific point about whitelisting a range, here's how:
BOTS = IPAddr.new('127.0.0.1/24') # a made up netblock
whitelist('some bots') {|req| BOTS.include?(req.ip) }
Good Day!
I really love your gem and implemented it after we had our first outage due to an aggressive crawler.
The only thing I miss so far is an "easier whitelist" for yahoo and the well known google bot. Is there any way to whitelist a IP range e.g.?
Thanks a lot and keep on rocking!
Chris