ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

Unexpected redirect to different URL #122

Closed crarugal closed 9 months ago

crarugal commented 9 months ago

The first capture of http://leeds2023.co.uk/ redirects to a completely different URL: http://www.slunglow.org/. all other captures for leeds2023 seem fine, except for this first capture.

I tried this through QA Wayback in the production and dev versions, and both have the same behaviour.

Accessing the first capture: image https://www.webarchive.org.uk/act/wayback/archive/20150924131417/http://leeds2023.co.uk/

Will take the user to: image

Looking at the CDX entry for http://leeds2023.co.uk/: https://www.webarchive.org.uk/act/wayback/archive/cdx?output=json&url=http%3A%2F%2Fleeds2023.co.uk%2F {"urlkey": "uk,co,leeds2023)/", "timestamp": "20150924131417", "url": "http://leeds2023.co.uk/", "mime": "text/html", "status": "302", "digest": "DL6U7LX4C2BWBS3ZO6BJYH74O2KBFCR3", "redirect": "-", "robotflags": "-", "length": "508", "offset": "790228315", "filename": "/heritrix/output/warcs/dc1-20150827/BL-20150924124957230-05165-22754/~crawler04/~8444.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive"}

Looking into WARC file: image image

Opening the WARC file entry, it looks as though the page attempts to redirect to "http://www.slunglow.org/". Not sure if this means the live page the time had a redirect to slunglow.org, or if this was an issue at crawl time or when it was being processed? : image

This is the incorrect (or maybe correct) capture we're being directed to: image https://www.webarchive.org.uk/wayback/en/archive/20150828172150/http://www.slunglow.org/ image

Looking at the CDX entry for that capture: https://www.webarchive.org.uk/wayback/en/archive/cdx?url=http%3A%2F%2Fwww.slunglow.org%2F&output=json {"urlkey": "org,slunglow)/", "timestamp": "20150828172150", "url": "http://www.slunglow.org/", "mime": "text/html", "status": "200", "digest": "FTX2MLPS6WIW52Q3EQ6ASDCWZO6H4FLP", "redirect": "-", "robotflags": "-", "length": "10504", "offset": "212135411", "filename": "/heritrix/output/warcs/dc3-20150827/BL-20150828171634982-00220-8100/~crawler02/~8446.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive", "access": "block"}

Looking at the WARC file: image

WARC comparison http://leeds2023.co.uk/ September 24th, 2015 at 13:14:17

http://www.slunglow.org/ August 28, 2015 at 6:21:50 -dc3-20150827/BL-20150828171634982-00220-8100/~crawler02/~8446.warc.gz

Internet Archive doesn't have any copies for 2015 for leeds2023.co.uk: https://web.archive.org/web/20150101000000*/http://leeds2023.co.uk image

website for leeds2023 was registered in 2015: image

I'm not sure if this is a bug or expected behaviour. The thing that made me suspicious are the timestamps for the WARC directories that are the same, except leeds2023 is DC1(dc1-20150827), and slunglow.org is DC3(dc3-20150827)

crarugal commented 9 months ago

Looks like it was linked, on the slunglow website, they mention Leeds 2023: image