Closed jeancochrane closed 5 years ago
The canonical way to block these bots would be to define this behavior in
/robots.txt
. I'm curious whether we should addrobots.txt
in this PR as well?
That is a valid point. I discarded that thought in the moment, but robots.txt
would be the fairer way to do this. Perhaps there is a canned robots.txt
file somewhere for CKAN installs?
Not really sure what the best way to test this PR is. Perhaps shell into the server and compare it to the live Nginx config at
/etc/nginx/sites-available/ckan
?
My plan was to render the template with Ansible and then compare that against the production file. Doesn't feel great, but provides a good degree of confidence.
Also, generating stats against the logs was a good idea. Good to see how much of an expected impact there will be.
Good instinct on the canned robots.txt
, this looks like what we want: https://github.com/ckan/ckan/blob/master/ckan/templates/robots.txt I'll investigate how to configure it.
Overview
During the incident described in #92, we adjusted the Nginx configuration directly on the server to block aggressive web crawlers. Persist that config change in the codebase so that it will be replicated the next time the app gets provisioned.
Notes
The canonical way to block these bots would be to define this behavior in
/robots.txt
. I'm curious whether we should addrobots.txt
in this PR as well?While making this change, I did a quick audit of bot behavior in the logs to confirm that we want to block these bots. As a result, I removed one bot from the original list defined in #91,
qwant
, which is a crawler for a legitimate search engine that only sent requests once per 30 seconds or so and stopped sending requests on 11/29.Blocking SemRushBot alone would have eliminated roughly half of the traffic between 11/25/18 and 12/3/18:
Testing Instructions
/etc/nginx/sites-available/ckan
?Checklist
Resolves #91