Closed nicolabingham closed 5 years ago
Just to give a little background: neither my crawls of www.gov.scot site, or the transition site, www2.gov.scot have completely worked for some time, though I've done what I can.
At the moment, Scottish Government information is spread over two sites as they move to a new CMS; access to older material is maintained through www2, as pages are edited to new criteria then moved to the new www.gov.scot. As they move, they rely more on UKWA picking up out-of-date publications for future reference - they aren't taking over material which isn't current.
In both domains, ACT's crawler does not always reach into the Resource folder, where pdf's are stored: https://www2.gov.scot/Resource/. NRS (Archive-It) have a similar, but not identical problem:
In the UKWA example, the page has captured, but the pdf's haven't. As a new crawler is now in place, I thought I'd try running these again next week. However, I can't edit these targets since the www2 URL is not valid in ACT. At the moment SG are managing to cope by redirecting to both.
Scottish Government - older content (www2): https://webarchive.org.uk/act/targets/72277 Scottish Government: https://webarchive.org.uk/act/targets/3152
This is related to #616 - we need to simplify the URL validation.
I'm not sure if this is a new issue, I will append it here. Since this URL has been in use for a while now, I would like to see a result appear for the target: https://www.webarchive.org.uk/act/targets/72277 in the Scottish Government collection.
At the moment, I think the system treats the www2 as subordinate to the www, which is why no result is listed. However, I want to create a flag telling the reader that the mirror is there, as content has been moving between here and the www for some time, giving it a greater importance in the timeline (2015-).
If there is a better way to do this, I am happy to be persuaded!
Hm, this is rather difficult. The older, more common practice of www# being different servers serving the same content is baked pretty deeply into all the web archiving tools. As scot.gov
have used it to host different content, it all gets a bit awkward.
I'll ask other web archives if/how they do it.
From Eilidh:
Hello, I'm having problems with this URL: https://www2.gov.scot/. I agreed to set this to collect next week, but ACT will not validate it (it had been before).