opendataphilly / opendataphilly-ckan

Port of OpenDataPhilly to CKAN
3 stars 2 forks source link

Persist Nginx config to block web crawlers #95

Closed jeancochrane closed 5 years ago

jeancochrane commented 5 years ago

Overview

During the incident described in #92, we adjusted the Nginx configuration directly on the server to block aggressive web crawlers. Persist that config change in the codebase so that it will be replicated the next time the app gets provisioned.

Notes

$ cat ckan_default.custom.log.1 | grep semrush | wc -l
  199198
$ cat ckan_default.custom.log.1 | wc -l
  425253

Testing Instructions

Checklist

Resolves #91

hectcastro commented 5 years ago

The canonical way to block these bots would be to define this behavior in /robots.txt. I'm curious whether we should add robots.txt in this PR as well?

That is a valid point. I discarded that thought in the moment, but robots.txt would be the fairer way to do this. Perhaps there is a canned robots.txt file somewhere for CKAN installs?

Not really sure what the best way to test this PR is. Perhaps shell into the server and compare it to the live Nginx config at /etc/nginx/sites-available/ckan?

My plan was to render the template with Ansible and then compare that against the production file. Doesn't feel great, but provides a good degree of confidence.

Also, generating stats against the logs was a good idea. Good to see how much of an expected impact there will be.

jeancochrane commented 5 years ago

Good instinct on the canned robots.txt, this looks like what we want: https://github.com/ckan/ckan/blob/master/ckan/templates/robots.txt I'll investigate how to configure it.