mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 196 forks source link

Reduce external search indexing of request list pages #8132

Open garethrees opened 7 months ago

garethrees commented 7 months ago

The main things we want indexed are record pages themselves (info request pages, user pages, authority pages, etc).

Snippets of request content often appear on list pages, and create a whack-a-mole situation when unhappy users find that external search engines have indexed a list page (e.g. /body/foo?page=12) that contains a cached snippet of PII that we've removed from the request page itself.

We should stop indexing of:

We might be able to do this via robots.txt, or could set via the X-Robots-Tag header depending on the page number:

before_action :set_no_crawl_headers, if: -> { params[:page].to_i > 1 }
garethrees commented 7 months ago

🤔 Similar requests should already be disallowed for indexing https://github.com/mysociety/alaveteli/blob/0.44.0.0/public/robots.txt#L19

HelenWDTK commented 7 months ago

It's not/similar/ it's /similar?page=4&utm_campaign=alaveteli-experiments-87&utm_content=sidebar_similar_requests&utm_medium=link&utm_source=whatdotheyknow

garethrees commented 7 months ago

The * should include anything after */similar/* – I can see the issue though; should be */similar*

HelenWDTK commented 7 months ago

Only if there is a / after the similar. See google (search /fish/ https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)

garethrees commented 5 months ago

Might as well cover the actions noted in https://github.com/mysociety/alaveteli/issues/8216 as part of this since it seems pretty easy to do: