Open garethrees opened 7 months ago
🤔 Similar requests should already be disallowed for indexing https://github.com/mysociety/alaveteli/blob/0.44.0.0/public/robots.txt#L19
It's not/similar/
it's /similar?page=4&utm_campaign=alaveteli-experiments-87&utm_content=sidebar_similar_requests&utm_medium=link&utm_source=whatdotheyknow
The *
should include anything after */similar/*
– I can see the issue though; should be */similar*
Only if there is a /
after the similar. See google (search /fish/ https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)
Might as well cover the actions noted in https://github.com/mysociety/alaveteli/issues/8216 as part of this since it seems pretty easy to do:
/request/SLUG/annotate
)/request/SLUG/similar
)
The main things we want indexed are record pages themselves (info request pages, user pages, authority pages, etc).
Snippets of request content often appear on list pages, and create a whack-a-mole situation when unhappy users find that external search engines have indexed a list page (e.g.
/body/foo?page=12
) that contains a cached snippet of PII that we've removed from the request page itself.We should stop indexing of:
/list/all
,/list/successful
, etc) with apage=
query param/request/:url_title/similar
)/body/:url_name?page=N
)/user/:url_name?page=N
)/user/:url_name/wall
)We might be able to do this via
robots.txt
, or could set via theX-Robots-Tag
header depending on the page number: