ukwa / w3act

w3act is an annotation and curation tool for building web archive collections
Apache License 2.0
19 stars 6 forks source link

Problems with crawling https://www2.gov.scot/ in ACT #615

Closed nicolabingham closed 5 years ago

nicolabingham commented 5 years ago

From Eilidh:

Hello, I'm having problems with this URL: https://www2.gov.scot/. I agreed to set this to collect next week, but ACT will not validate it (it had been before).

emacglone commented 5 years ago

Just to give a little background: neither my crawls of www.gov.scot site, or the transition site, www2.gov.scot have completely worked for some time, though I've done what I can.

At the moment, Scottish Government information is spread over two sites as they move to a new CMS; access to older material is maintained through www2, as pages are edited to new criteria then moved to the new www.gov.scot. As they move, they rely more on UKWA picking up out-of-date publications for future reference - they aren't taking over material which isn't current.

In both domains, ACT's crawler does not always reach into the Resource folder, where pdf's are stored: https://www2.gov.scot/Resource/. NRS (Archive-It) have a similar, but not identical problem:

https://www.webarchive.org.uk/wayback/archive/20181225192646/https:/www2.gov.scot/Topics/farmingrural/SRDP/SRDP2014-2020RDOC/AIR2017

https://webarchive.nrscotland.gov.uk/*/https://www.webarchive.org.uk/wayback/archive/20181225192610/https:/www2.gov.scot/Topics/farmingrural/SRDP/SRDP2014-2020RDOC

In the UKWA example, the page has captured, but the pdf's haven't. As a new crawler is now in place, I thought I'd try running these again next week. However, I can't edit these targets since the www2 URL is not valid in ACT. At the moment SG are managing to cope by redirecting to both.

Scottish Government - older content (www2): https://webarchive.org.uk/act/targets/72277 Scottish Government: https://webarchive.org.uk/act/targets/3152

anjackson commented 5 years ago

This is related to #616 - we need to simplify the URL validation.

emacglone commented 4 years ago

I'm not sure if this is a new issue, I will append it here. Since this URL has been in use for a while now, I would like to see a result appear for the target: https://www.webarchive.org.uk/act/targets/72277 in the Scottish Government collection.

At the moment, I think the system treats the www2 as subordinate to the www, which is why no result is listed. However, I want to create a flag telling the reader that the mirror is there, as content has been moving between here and the www for some time, giving it a greater importance in the timeline (2015-).

If there is a better way to do this, I am happy to be persuaded!

anjackson commented 4 years ago

Hm, this is rather difficult. The older, more common practice of www# being different servers serving the same content is baked pretty deeply into all the web archiving tools. As scot.gov have used it to host different content, it all gets a bit awkward.

I'll ask other web archives if/how they do it.