Homepage blocked by non-Google and Microcost bots, and other `robots.txt` oddness

symroe commented 1 year ago

The current robots.txt file allows Googlebot and bingbot to crawl the home page, but disallows other bots from doing the same thing:

(code block snipped to highlight lines)

User-Agent: Googlebot
User-Agent: bingbot
Allow: /
...
Disallow: /$

User-Agent: *
Disallow: /

This is somewhat confusing, especially the seemingly contradictory Disallow: /$ line for Googlebot and bingbot.

I think the intent is to allow all robots to access the home page (/) and the list of allowed pages. All other pages should be disallowed. If that's true, I think it can be simplified to:

User-Agent: *
Allow: /
Allow: /transactional-licence-form
Allow: /about-this-service
Allow: /how-to-use-this-service
Allow: /accessibility-statement
Allow: /open-justice-licence
Allow: /terms-of-use
Disallow: /*/

As an aside, when looking at this I found two other URLs that are indexed:

api.caselaw.nationalarchives.gov.uk isn't excluded from crawling
assets.caselaw.nationalarchives.gov.uk has a robots.txt file excluding / but results are showing for it on a Google site: search. I think this is meant to be Disallow: * for all bots?

dragon-dxw commented 1 year ago

assets robots.txt is now

User-Agent: *
Disallow: *

dragon-dxw commented 1 year ago

Disallow: * is not the right syntax -- it should be Disallow: /. Put both in assets.

jacksonj04 commented 1 year ago

Disallow: * is not the right syntax -- it should be Disallow: /. Put both in assets.

Was just about to say the same! Just to confirm:

[ ] api.caselaw has User-Agent: * and Disallow: *
[ ] assets.caselaw has User-Agent: * and Disallow: *

timcowlishaw commented 1 year ago

Another one that seems to be being indexed by google:

https://caselaw.nationalarchives.gov.uk/ewca/civ/2023/1360

See This search query - it's the top result.

timcowlishaw commented 1 year ago

It might be worth checking the common crawl url index too at some point, (when it's working, it's giving a 500 this morning): http://urlsearch.commoncrawl.org/

symroe commented 1 year ago

This is a somewhat disappointing outcome from my point of view!

We've gone from blocking the home page on some search engines to blocking it on all search engines.

This means that not only can I not find the home page (using DuckDuckGo), soon no one will be able to.

I understand this is something that your team will have decided to do, but it's a little confusing that this is the decision.

Good to see you've fixed the assets subdomain. The api subdomain still doesn't seem to have a robots.txt though.

nationalarchives / ds-caselaw-public-ui

Homepage blocked by non-Google and Microcost bots, and other `robots.txt` oddness #985