rack / rack-attack

Rack middleware for blocking & throttling
MIT License
5.56k stars 337 forks source link

Whitelist Google and Bing Bots? #143

Closed chr1s1 closed 8 years ago

chr1s1 commented 9 years ago

Good Day!

I really love your gem and implemented it after we had our first outage due to an aggressive crawler.

The only thing I miss so far is an "easier whitelist" for yahoo and the well known google bot. Is there any way to whitelist a IP range e.g.?

Thanks a lot and keep on rocking!

Chris

pisaacs commented 9 years ago

Similarly to blacklisting a set of IPs you could employ similar logic for whitelisting IPs or perhaps even ranges. See https://github.com/kickstarter/rack-attack/wiki/Advanced-Configuration#blacklisting-from-railscache for more information.

You could keep the IPs or IP ranges in persistent store (file, db), and during app bootup, write the values to the cache store (and also flag them for persistence such as allowed in redis).

sandstrom commented 8 years ago

I think you can tell Google and Yahoo to keep below your limits, that way they'll never get blocked.

Also, I'm pretty sure that they'll automatically reduce the crawl rate if they receive a 429.

https://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_directive https://support.google.com/webmasters/answer/48620?hl=en

ktheory commented 8 years ago

Hi @chr1s1! Good question. @pisaacs & @sandstrom have good suggestions for effectively allowing crawlers.

To your specific point about whitelisting a range, here's how:

BOTS = IPAddr.new('127.0.0.1/24') # a made up netblock

whitelist('some bots') {|req| BOTS.include?(req.ip) }