Open nicolabingham opened 5 years ago
Many of these cases appear to be down the the fact that the early records are stored under http://hostname/index.html
rather than http://hostname/
. Will continue to investigate as time allows.
The www.eglwysfair.org site may refer to http://lyndafis2.users.btopenworld.com/eglwysfair/mair_c.html which we have copies of from 2004-2006.
Had another look at this, and tried using OutbackCDX's alias feature. I created this alias file:
@alias https://www.iwa.org.uk/index.html https://www.iwa.org.uk/
and then added the alias to the production index:
curl -X POST --data-binary @test.cdx http://cdx2:8080/data-heritrix
And now the timelines for both / and /index.html are the same, going back to 2004. Presumably, given the service is now at iwa.wales
, we could add that alias in as well and have a single timeline across all three.
If this looks about right we could look into aliasing all these examples. Unfortunately, it seems to be a bit difficult to tell what the original seed URLs were. The WCT database did not keep records of old seed URLs over time (and come to think of it, neither does W3ACT). But it should be possible to infer the seeds from the WARCs.
As an example of a difficult case, bloc.org.uk doesn't seem to go back that far. Exploring the WCT database, I wonder if it used to be www.writers-bloc.org.uk/index.html ? Are these different iterations of the same site? Or different sites?
EDIT Seems they are different sites, but looking at the WCT data, there are no Target Instances
associated with bloc.org.uk
, so as far as I can tell, that particular entry is either correct, or some data got lost from PANDAS maybe?
NLW report that these websites should all have instances continually from 2004 onwards: http://www.bloc.org.uk/ http://www.enlli.org/ http://www.morfablog.com/ http://www.cymruarywe.org/ http://www.waleswatch.welshnet.co.uk/ http://academi.org/ http://www.eglwysfair.org/ http://www.eisteddfod.org.uk/ https://mennaelfyn.co.uk/ http://www.fortunecity.com/business/pencil/1572/ http://www.grahamedavies.com/ https://www.iwa.wales/ http://www.ewrop.com/ http://gwleidydd.blogspot.com/
There are instances from 2008 onwards in QA Wayback but the earlier instances are missing.