Closed symroe closed 1 year ago
assets robots.txt is now
User-Agent: *
Disallow: *
Disallow: *
is not the right syntax -- it should be Disallow: /
. Put both in assets.
Disallow: *
is not the right syntax -- it should beDisallow: /
. Put both in assets.
Was just about to say the same! Just to confirm:
api.caselaw
has User-Agent: *
and Disallow: *
assets.caselaw
has User-Agent: *
and Disallow: *
Another one that seems to be being indexed by google:
https://caselaw.nationalarchives.gov.uk/ewca/civ/2023/1360
See This search query - it's the top result.
It might be worth checking the common crawl url index too at some point, (when it's working, it's giving a 500 this morning): http://urlsearch.commoncrawl.org/
This is a somewhat disappointing outcome from my point of view!
We've gone from blocking the home page on some search engines to blocking it on all search engines.
This means that not only can I not find the home page (using DuckDuckGo), soon no one will be able to.
I understand this is something that your team will have decided to do, but it's a little confusing that this is the decision.
Good to see you've fixed the assets
subdomain. The api
subdomain still doesn't seem to have a robots.txt though.
The current
robots.txt
file allowsGooglebot
andbingbot
to crawl the home page, but disallows other bots from doing the same thing:(code block snipped to highlight lines)
This is somewhat confusing, especially the seemingly contradictory
Disallow: /$
line forGooglebot
andbingbot
.I think the intent is to allow all robots to access the home page (
/
) and the list of allowed pages. All other pages should be disallowed. If that's true, I think it can be simplified to:As an aside, when looking at this I found two other URLs that are indexed:
api.caselaw.nationalarchives.gov.uk
isn't excluded from crawlingassets.caselaw.nationalarchives.gov.uk
has arobots.txt
file excluding/
but results are showing for it on a Googlesite:
search. I think this is meant to beDisallow: *
for all bots?