ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Some earlier instances missing in Wayback #81

Open nicolabingham opened 5 years ago

nicolabingham commented 5 years ago

NLW report that these websites should all have instances continually from 2004 onwards: http://www.bloc.org.uk/ http://www.enlli.org/ http://www.morfablog.com/ http://www.cymruarywe.org/ http://www.waleswatch.welshnet.co.uk/ http://academi.org/ http://www.eglwysfair.org/ http://www.eisteddfod.org.uk/ https://mennaelfyn.co.uk/ http://www.fortunecity.com/business/pencil/1572/ http://www.grahamedavies.com/ https://www.iwa.wales/ http://www.ewrop.com/ http://gwleidydd.blogspot.com/

There are instances from 2008 onwards in QA Wayback but the earlier instances are missing.

anjackson commented 5 years ago

Many of these cases appear to be down the the fact that the early records are stored under http://hostname/index.html rather than http://hostname/. Will continue to investigate as time allows.

anjackson commented 5 years ago

The www.eglwysfair.org site may refer to http://lyndafis2.users.btopenworld.com/eglwysfair/mair_c.html which we have copies of from 2004-2006.

anjackson commented 2 years ago

Had another look at this, and tried using OutbackCDX's alias feature. I created this alias file:

@alias https://www.iwa.org.uk/index.html https://www.iwa.org.uk/

and then added the alias to the production index:

curl -X POST --data-binary @test.cdx http://cdx2:8080/data-heritrix

And now the timelines for both / and /index.html are the same, going back to 2004. Presumably, given the service is now at iwa.wales, we could add that alias in as well and have a single timeline across all three.

If this looks about right we could look into aliasing all these examples. Unfortunately, it seems to be a bit difficult to tell what the original seed URLs were. The WCT database did not keep records of old seed URLs over time (and come to think of it, neither does W3ACT). But it should be possible to infer the seeds from the WARCs.

anjackson commented 2 years ago

As an example of a difficult case, bloc.org.uk doesn't seem to go back that far. Exploring the WCT database, I wonder if it used to be www.writers-bloc.org.uk/index.html ? Are these different iterations of the same site? Or different sites?

EDIT Seems they are different sites, but looking at the WCT data, there are no Target Instances associated with bloc.org.uk, so as far as I can tell, that particular entry is either correct, or some data got lost from PANDAS maybe?