Persist Nginx config to block web crawlers

jeancochrane commented 5 years ago

Overview

During the incident described in #92, we adjusted the Nginx configuration directly on the server to block aggressive web crawlers. Persist that config change in the codebase so that it will be replicated the next time the app gets provisioned.

Notes

The canonical way to block these bots would be to define this behavior in /robots.txt. I'm curious whether we should add robots.txt in this PR as well?
While making this change, I did a quick audit of bot behavior in the logs to confirm that we want to block these bots. As a result, I removed one bot from the original list defined in #91, qwant, which is a crawler for a legitimate search engine that only sent requests once per 30 seconds or so and stopped sending requests on 11/29.
Blocking SemRushBot alone would have eliminated roughly half of the traffic between 11/25/18 and 12/3/18:

$ cat ckan_default.custom.log.1 | grep semrush | wc -l
  199198
$ cat ckan_default.custom.log.1 | wc -l
  425253

Testing Instructions

Not really sure what the best way to test this PR is. Perhaps shell into the server and compare it to the live Nginx config at /etc/nginx/sites-available/ckan?

Checklist

[X] Manual upgrade steps added to UPGRADING_2.2_TO_2.8.md? (N/A)

Resolves #91

hectcastro commented 5 years ago

The canonical way to block these bots would be to define this behavior in /robots.txt. I'm curious whether we should add robots.txt in this PR as well?

That is a valid point. I discarded that thought in the moment, but robots.txt would be the fairer way to do this. Perhaps there is a canned robots.txt file somewhere for CKAN installs?

Not really sure what the best way to test this PR is. Perhaps shell into the server and compare it to the live Nginx config at /etc/nginx/sites-available/ckan?

My plan was to render the template with Ansible and then compare that against the production file. Doesn't feel great, but provides a good degree of confidence.

Also, generating stats against the logs was a good idea. Good to see how much of an expected impact there will be.

jeancochrane commented 5 years ago

Good instinct on the canned robots.txt, this looks like what we want: https://github.com/ckan/ckan/blob/master/ckan/templates/robots.txt I'll investigate how to configure it.

opendataphilly / opendataphilly-ckan