sul-dlss / was-pywb

Configuration for Stanford's pywb instance
https://swap.stanford.edu
Other
2 stars 0 forks source link

robots.txt #194

Open edsu opened 1 year ago

edsu commented 1 year ago

Users have noticed that swap.stanford.edu can become unresponsive when under load. Some investigation of the logs showed that this can happen when we have seen sustained attention from bots (e.g. Yandex). In the most recent case Yandex was doing about 1.5 requests per second from about 8 IP addresses, which caused swap to be pretty much unusable because all the CPUs were at 100% utilization.

Even though Yandex do not respect the crawl-delay directive in robots.txt files we think it would be good to instruct the crawlers that do (Google, Bing, Facebook, etc) with a:

User-agent: *
Crawl-delay: 10

If we continue to run into performance problems we should consider:

See https://github.com/sul-dlss/puppet/pull/9614 for the robots.txt change.

If Yandex is going to be blocked in perpetuity it would be preferable to do that in the robots.txt rather than at the IP level, which is what we are doing currently: https://github.com/sul-dlss/puppet/pull/9619