ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Appsec: block crawlers from DataHub and Find MoJ Data #760

Open jemnery opened 2 months ago

jemnery commented 2 months ago

Add robots.txt / noindex / nofollow headers to prevent crawlers from indexing our services.

Research the current best practice here.

MatMoore commented 2 months ago

https://developers.google.com/search/docs/crawling-indexing/block-indexing https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta/name#other_metadata_names

seems like we should go with <meta name="robots" content="noindex, nofollow" />

tom-webber commented 2 months ago

As a note, robots.txt is a guideline rather than an enforceable rule. Some companies do respect it, but others don't. If we want to block crawlers, we'll need to block their published crawler IPs.