ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Allow search engines to index archived website #96

Open nicolabingham opened 2 years ago

nicolabingham commented 2 years ago

Please can we modify the public website robots.txt file so that this archived website: https://www.webarchive.org.uk/act/wayback/archive/20190313122106/http://www.europeandialogue.org/ can be indexed by search engines. This is a permission-cleared website which no longer exists on the live web, the content owners would like it to be discoverable. I will submit it to Google so that they can find it afterwards.

anjackson commented 2 years ago

Implemented, will roll-out with the other updates to the website.

nicolabingham commented 2 years ago

Ah I'm so sorry, I've pasted the wrong URL into here, it should be https://www.webarchive.org.uk/wayback/archive/*/http://www.europeandialogue.org/

anjackson commented 2 years ago

No worries I realized that.

nicolabingham commented 1 year ago

Sorry, @anjackson can you help with the verification step for Google please? I submitted the URL (https://www.webarchive.org.uk/wayback/archive/*/http://www.europeandialogue.org/) to Google for indexing, but Google requires a verification step which I'm struggling to complete. Is it possible to download the code provided by Google and upload it to the website? Alternatively, could you complete one of the other verification methods? Thanks verify

anjackson commented 1 year ago

We should be able to use the Google Analytics option, but it doesn't like the way the analytics have been installed. I'm rolling a ukwa/ukwa-pywb:2.6.7.3 release with the analytics code in the <head> of the page.

anjackson commented 1 year ago

Er, in trying to fix this, ended up doing the registration. Tried to give you access too!

anjackson commented 1 year ago

Is this all done now?

nicolabingham commented 1 year ago

No, sorry, it hasn't been indexed.

anjackson commented 1 year ago

That addition to robots.txt got lost when we switched over to the new site system. I'll look into deploying it.

anjackson commented 1 year ago

I used the Google robots.txt tester on the BETA version and it at least that part is working.

2023-01-13-robots-txt-tester

anjackson commented 1 year ago

Okay, https://www.webarchive.org.uk/robots.txt is now updated. Looking it up in the search console the item is crawled but not indexed: https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=Ab4uuoLvcNyEFYNN6aDV3w&hl=en

The page was crawled by Google but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.

Not sure what this means. Perhaps it already picked up the change to robots.txt but hasn't got to indexing it yet. Worth re-trying in a day or two.

anjackson commented 1 year ago

Ah, I was missing some subtleties in the robots.txt tester (not starting the test path at wayback/... but at e.g. /wayback/...), and having two separate Allow sections seem to confuse things. Updating to fix that and also allow sub-paths of this site to be indexed:

Allow: /wayback/archive/*/http://www.europeandialogue.org/*
anjackson commented 1 year ago

Okay, finally able to request indexing. Should hopefully turn up at https://search.google.com/search-console/inspect?resource_id=https%3A%2F%2Fwww.webarchive.org.uk%2F&id=v-0hCfCrroprt3LcRIOgdw&alt_id=8syXgdFjnRsyHzfUNoQsfQ&hl=en before too long...

anjackson commented 1 year ago

Hmm, it's taking a while. Perhaps we need to encourage it by linking to it from somewhere?

anjackson commented 1 year ago

Tried blogging to encourage indexing (https://anjackson.net/2023/03/09/letting-search-engines-into-the-archive/), and it may help but it's not done much yet.

So, I'm looking at ensuring the ukwa-site site map gets indexed and adding page intended for search engines that links to the specific sites we're wanting indexed: https://github.com/ukwa/ukwa-site/commit/dd457d5c014780c65a5b74142a0864764afd197a

nicolabingham commented 1 year ago

Thanks for pursuing this one. Finger's crossed it will get indexed.

anjackson commented 1 year ago

BTW, I've also started some changes that allow us to link such sites into the main site's sitemap. These are part of the changes on BETA, so when we're okay to move forward with that, we can see if that helps this issue.

anjackson commented 1 year ago

One unexpected outcome from the IIPC conference was Daniel from PWA telling me that we should very much NOT do this! Apparently, the PWA got blocked as a dangerous website by Chrome, because the clever URL mashing that PyWB does for playback sets off some kind of alarm when crawled by Google. These lists of bad sites get passed around, and it seems they only managed to get off the list because they have a good relationship with Malwarebytes.

We should perhaps consider whether the right approach is to resurrect the idea of each Target having a public web page on the site. We then allow that to be indexed, which lets people find a link to the website rather than the site itself.

Or, perhaps it is possible to offer a different version of the website to crawlers, which does not do fancy re-writing.

nicolabingham commented 1 year ago

Ah crikey! Good that you found this out from Daniel. I would favour the approach of each Target having a public web page on the site, if that's possible.