ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

Uknown QA PyWb playback issue #60

Closed crarugal closed 3 years ago

crarugal commented 3 years ago

Occasionally, QA PyWb will have difficulty playing back an instance. The error will look like this in QA PyWb: image The text output is: {'args': {'coll': 'archive', 'type': 'replay', 'index_paths': 'cdx+http://cdx.api.wa.bl.uk/data-heritrix', 'archive_paths': 'webhdfs://hdfs.api.wa.bl.uk'}, 'error': '{"message": "Self Redirect www.boris-johnson.com/?post_type=feedback&p=16511 -> www.boris-johnson.com?post_type=feedback&p=16511", "errors": {"WARCPathLoader": "Self Redirect www.boris-johnson.com/?post_type=feedback&p=16511 -> www.boris-johnson.com?post_type=feedback&p=16511"}}'}

This example relates to: http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html https://www.webarchive.org.uk/act/targets/75291 image

There are currently 13 captures, of which only 2005 and 2010 render correctly. https://www.webarchive.org.uk/act/wayback/archive/*/http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html image

In QA PyWb, the captures for 2005 and 2010 work fine: image (6)

But, viewing them on site within an LDL, you'll be presented with this for all 13 captures: image (5)

The issue we'd like to understand is the unknown error that's causing this in QA PyWb playback: image

Performing a CDX lookup: http://cdx.api.wa.bl.uk/data-heritrix?sort=reverse&url=http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20171001165824 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html warc/revisit 0 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 0 831627754 /heritrix/output/warcs/quarterly/20171001020057/BL-20171001162717327-00203-62~ukwa-h3-pulse-quarterly~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20170703141726 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 0 982885537 /heritrix/output/warcs/dc2-20170515/BL-20170703141353704-08566-3882~crawler04.bl.uk~8445.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20170701193738 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html warc/revisit 0 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 0 18882433 /heritrix/output/warcs/quarterly/20170701010128/BL-20170701193627114-00237-63~ukwa-h3-pulse-quarterly~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20170401103815 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 0 345229347 /heritrix/output/warcs/quarterly/20170401010022/BL-20170401103259718-00419-63~ukwa-h3-pulse-quarterly~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20170215051613 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 0 50473077 /heritrix/output/warcs/quarterly/20170214221843/BL-20170215051540005-00297-62~ukwa-h3-pulse-quarterly~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20170106051835 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html warc/revisit 0 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 0 722944987 /heritrix/output/warcs/quarterly/20170104102611/BL-20170106030652322-00235-14322~crawler03~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20161003191107 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 0 33572067 /heritrix/output/warcs/quarterly/20161001111030/BL-20161003190834173-00691-2025~194.66.232.91~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20160521182234 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 404 4XQ3ZP4JHZCRAH24GOHMRHMYPKB3OBZD - - 0 628907995 /heritrix/output/warcs/quarterly/20160520112913/BL-20160521040215286-00938-10949~crawler03~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20160402191116 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 404 3DQALPMMQEWXJHZ3TN2PAQ4AVMMPKHTH - - 0 469352371 /heritrix/output/warcs/quarterly/20160401111424/BL-20160402185546891-00186-31312~crawler03~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20160224040019 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 404 GV564KL4FF7X7TGKLKKCUKT76YPN4JNJ - - 0 667051381 /heritrix/output/warcs/quarterly/20160222175239/BL-20160223144742333-00985-19911~crawler03~8443.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20151219222802 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 400 NPP6XSSKZMW2LAC64DNZHYELI4GHCZRT - - 0 900431492 /heritrix/output/warcs/dc2-20150827/BL-20151219215304429-18655-15367~crawler04~8445.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20100516212647 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html - 0 - - 0 274900 /data/102006/102014/WARCS/BL-102014-0.warc.gz com,boris-johnson)/archives/2004/12/tsunami_disaste.html 20050119120000 http://www.boris-johnson.com/archives/2004/12/tsunami_disaste.html text/html 200 524447327526744433272525345436766665537564====== - - 0 286881 /data/102006/102014/WARCS/BL-102014-0.warc.gz

anjackson commented 3 years ago

I think this might be a manifestation of https://github.com/webrecorder/pywb/issues/591

anjackson commented 3 years ago

Ah, sadly not that. It seems a crawl got interrupted before resolving a redirect, and this gets reported as a loop. I just popped the URL in the queue, but looking at the crawl-time index, it seems this is now 404. :-(

com,boris-johnson)/?p=17067&post_type=feedback 20201211212738 http://www.boris-johnson.com/?post_type=feedback&p=17067 text/html 301 SHAYDLSL2F7HXTO6XGGG75SD6PSRRSDDLA====== - - 1142 52519761 BL-NPLD-WEBRENDER-frequent-npld-20201005094358-20201211212825939-06150-iy5brqfo.warc.gz
com,boris-johnson)/?p=17067&post_type=feedback 20201211212739 https://www.boris-johnson.com/?post_type=feedback&p=17067 text/html 404 SHAXNORNRTBQM3SE6TIUWZUQJBTSGSVO4M====== - - 5164 52755645 BL-NPLD-WEBRENDER-frequent-npld-20201005094358-20201211212825939-06150-iy5brqfo.warc.gz