Target | Website | Example instance or latest instance | Comment
-- | -- | -- | --
https://www.webarchive.org.uk/act/targets/128627 | https://www.signatureaviation.com/ | https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ | seems ok now
https://www.webarchive.org.uk/act/targets/3706 | http://www.crawleyobserver.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ | seems ok now
https://www.webarchive.org.uk/act/targets/136007 | https://www.teachwire.net/ | https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ | seem ok now
https://www.webarchive.org.uk/act/targets/147300 | https://www.schuh.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ | still not crawling
https://www.webarchive.org.uk/act/targets/155587#crawlpolicy | https://cilexjournal.org.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ | still not crawling
https://www.webarchive.org.uk/act/targets/149261 | https://teamnnuh.co.uk/ | | no captures, no info in logs
https://www.webarchive.org.uk/act/targets/156010 | https://hospicefoundation.ie/ | https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ | still not crawling
https://www.webarchive.org.uk/act/targets/156865 | https://www.odeon.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/ | still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/157334 | https://muslimcharity.org.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ | still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/159206 | https://www.greencoat-renewables.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ | still an issue
https://www.webarchive.org.uk/act/targets/158590 | https://www.diehardia.com/ | | no captures, no info in logs
https://www.webarchive.org.uk/act/targets/157211 | https://www.poferries.com/ | | not crawling since March 2022, -5000, -5002
https://www.webarchive.org.uk/act/targets/3851 | https://www.thetimes.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ | still an issue, cloudfront
https://www.webarchive.org.uk/act/targets/160154 | https://www.techagainstterrorism.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ | still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/160474 | https://www.riverstonellc.com/ | | not crawling since May 2022, -5002
https://www.webarchive.org.uk/act/targets/161338 | https://www.missguided.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/ | still an issue, captcha
https://www.webarchive.org.uk/act/targets/10645 | https://www.fortnumandmason.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/ | still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/161938 | https://www.amnh.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU | still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/131772 | https://cumbriacrack.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ | still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/149065 | https://ort.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ | still an issue, cloudflare
https://www.webarchive.org.uk/act/targets/164270 | https://www.vistrygroup.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ | not crawling, -5002
Here are a few examples of where Heritrix has been prevented by a firewall or captchas:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Target | Website | Example instance or latest instance | Comment -- | -- | -- | -- https://www.webarchive.org.uk/act/targets/128627 | https://www.signatureaviation.com/ | https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ | seems ok now https://www.webarchive.org.uk/act/targets/3706 | http://www.crawleyobserver.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ | seems ok now https://www.webarchive.org.uk/act/targets/136007 | https://www.teachwire.net/ | https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ | seem ok now https://www.webarchive.org.uk/act/targets/147300 | https://www.schuh.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ | still not crawling https://www.webarchive.org.uk/act/targets/155587#crawlpolicy | https://cilexjournal.org.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ | still not crawling https://www.webarchive.org.uk/act/targets/149261 | https://teamnnuh.co.uk/ | | no captures, no info in logs https://www.webarchive.org.uk/act/targets/156010 | https://hospicefoundation.ie/ | https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ | still not crawling https://www.webarchive.org.uk/act/targets/156865 | https://www.odeon.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/157334 | https://muslimcharity.org.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/159206 | https://www.greencoat-renewables.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ | still an issue https://www.webarchive.org.uk/act/targets/158590 | https://www.diehardia.com/ | | no captures, no info in logs https://www.webarchive.org.uk/act/targets/157211 | https://www.poferries.com/ | | not crawling since March 2022, -5000, -5002 https://www.webarchive.org.uk/act/targets/3851 | https://www.thetimes.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ | still an issue, cloudfront https://www.webarchive.org.uk/act/targets/160154 | https://www.techagainstterrorism.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/160474 | https://www.riverstonellc.com/ | | not crawling since May 2022, -5002 https://www.webarchive.org.uk/act/targets/161338 | https://www.missguided.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/ | still an issue, captcha https://www.webarchive.org.uk/act/targets/10645 | https://www.fortnumandmason.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/161938 | https://www.amnh.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/131772 | https://cumbriacrack.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/149065 | https://ort.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/164270 | https://www.vistrygroup.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ | not crawling, -5002