Reduce external search indexing of request list pages

garethrees commented 7 months ago

The main things we want indexed are record pages themselves (info request pages, user pages, authority pages, etc).

Snippets of request content often appear on list pages, and create a whack-a-mole situation when unhappy users find that external search engines have indexed a list page (e.g. /body/foo?page=12) that contains a cached snippet of PII that we've removed from the request page itself.

We should stop indexing of:

[x] Request list pages (/list/all, /list/successful, etc) with a page= query param
[x] Similar requests page (/request/:url_title/similar)
[x] Body pages with a page= query param (/body/:url_name?page=N)
[x] User pages with a page= query param (/user/:url_name?page=N)
[x] User "wall" page (/user/:url_name/wall)

We might be able to do this via robots.txt, or could set via the X-Robots-Tag header depending on the page number:

before_action :set_no_crawl_headers, if: -> { params[:page].to_i > 1 }

garethrees commented 7 months ago

🤔 Similar requests should already be disallowed for indexing https://github.com/mysociety/alaveteli/blob/0.44.0.0/public/robots.txt#L19

HelenWDTK commented 7 months ago

It's not/similar/ it's /similar?page=4&utm_campaign=alaveteli-experiments-87&utm_content=sidebar_similar_requests&utm_medium=link&utm_source=whatdotheyknow

garethrees commented 7 months ago

The * should include anything after */similar/* – I can see the issue though; should be */similar*

HelenWDTK commented 7 months ago

Only if there is a / after the similar. See google (search /fish/ https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)

garethrees commented 5 months ago

Might as well cover the actions noted in https://github.com/mysociety/alaveteli/issues/8216 as part of this since it seems pretty easy to do:

[ ] The annotation page (/request/SLUG/annotate)
[ ] The similar requests page (/request/SLUG/similar)
[ ] Any links that always require a sign-in (reply, report, status update, request ZIP download)

mysociety / alaveteli

Reduce external search indexing of request list pages #8132