ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Firewall issues when crawling some websites #83

Open crarugal opened 1 year ago

crarugal commented 1 year ago

Here are a few examples of where Heritrix has been prevented by a firewall or captchas:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Target | Website | Example instance or latest instance | Comment -- | -- | -- | -- https://www.webarchive.org.uk/act/targets/128627 | https://www.signatureaviation.com/ | https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ | seems ok now https://www.webarchive.org.uk/act/targets/3706 | http://www.crawleyobserver.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ | seems ok now https://www.webarchive.org.uk/act/targets/136007 | https://www.teachwire.net/ | https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ | seem ok now https://www.webarchive.org.uk/act/targets/147300 | https://www.schuh.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ | still not crawling https://www.webarchive.org.uk/act/targets/155587#crawlpolicy | https://cilexjournal.org.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ | still not crawling https://www.webarchive.org.uk/act/targets/149261 | https://teamnnuh.co.uk/ |   | no captures, no info in logs https://www.webarchive.org.uk/act/targets/156010 | https://hospicefoundation.ie/ | https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ | still not crawling https://www.webarchive.org.uk/act/targets/156865 | https://www.odeon.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/157334 | https://muslimcharity.org.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/159206 | https://www.greencoat-renewables.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ | still an issue https://www.webarchive.org.uk/act/targets/158590 | https://www.diehardia.com/ |   | no captures, no info in logs https://www.webarchive.org.uk/act/targets/157211 | https://www.poferries.com/ |   | not crawling since March 2022, -5000, -5002 https://www.webarchive.org.uk/act/targets/3851 | https://www.thetimes.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ | still an issue, cloudfront https://www.webarchive.org.uk/act/targets/160154 | https://www.techagainstterrorism.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/160474 | https://www.riverstonellc.com/ |   | not crawling since May 2022, -5002 https://www.webarchive.org.uk/act/targets/161338 | https://www.missguided.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/ | still an issue, captcha https://www.webarchive.org.uk/act/targets/10645 | https://www.fortnumandmason.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/161938 | https://www.amnh.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/131772 | https://cumbriacrack.com/ | https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/149065 | https://ort.org/ | https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ | still an issue, cloudflare https://www.webarchive.org.uk/act/targets/164270 | https://www.vistrygroup.co.uk/ | https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ | not crawling, -5002

Target Website Example instance or latest instance Comment https://www.webarchive.org.uk/act/targets/128627 https://www.signatureaviation.com/ https://www.webarchive.org.uk/act/wayback/archive/20210824094106/https://www.signatureaviation.com/ seems ok now https://www.webarchive.org.uk/act/targets/3706 http://www.crawleyobserver.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20210402101105/http://www.crawleyobserver.co.uk/ seems ok now https://www.webarchive.org.uk/act/targets/136007 https://www.teachwire.net/ https://www.webarchive.org.uk/act/wayback/archive/20220106111103/https://www.teachwire.net/ seem ok now https://www.webarchive.org.uk/act/targets/147300 https://www.schuh.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220721100444/https://www.schuh.co.uk/ still not crawling https://www.webarchive.org.uk/act/targets/155587#crawlpolicy https://cilexjournal.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220220090651/https://cilexjournal.org.uk/ still not crawling https://www.webarchive.org.uk/act/targets/149261 https://teamnnuh.co.uk/ no captures, no info in logs https://www.webarchive.org.uk/act/targets/156010 https://hospicefoundation.ie/ https://www.webarchive.org.uk/act/wayback/archive/20220715091752/https://hospicefoundation.ie/ still not crawling https://www.webarchive.org.uk/act/targets/156865 https://www.odeon.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220730103903/https://www.odeon.co.uk/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/157334 https://muslimcharity.org.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220723100028/https://muslimcharity.org.uk/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/159206 https://www.greencoat-renewables.com/ https://www.webarchive.org.uk/act/wayback/archive/20220722094820/https://www.greencoat-renewables.com/ still an issue https://www.webarchive.org.uk/act/targets/158590 https://www.diehardia.com/ no captures, no info in logs https://www.webarchive.org.uk/act/targets/157211 https://www.poferries.com/ not crawling since March 2022, -5000, -5002 https://www.webarchive.org.uk/act/targets/3851 https://www.thetimes.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220801103716/https://www.thetimes.co.uk/ still an issue, cloudfront https://www.webarchive.org.uk/act/targets/160154 https://www.techagainstterrorism.org/ https://www.webarchive.org.uk/act/wayback/archive/20220728094007/https://www.techagainstterrorism.org/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/160474 https://www.riverstonellc.com/ not crawling since May 2022, -5002 https://www.webarchive.org.uk/act/targets/161338 https://www.missguided.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220621091058/https://www.missguided.co.uk/ still an issue, captcha https://www.webarchive.org.uk/act/targets/10645 https://www.fortnumandmason.com/ https://www.webarchive.org.uk/act/wayback/archive/20220627094005/https://www.fortnumandmason.com/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/161938 https://www.amnh.org/ https://www.webarchive.org.uk/act/wayback/archive/20220705102408/https://www.amnh.org/research/darwin-manuscripts/?__cf_chl_rt_tk=AXI6j2vIje19Hv9U7uS5sSTpF9t2GyuzsVvhaLnUVDU-1657016648-0-gaNycGzNCKU still an issue, cloudflare https://www.webarchive.org.uk/act/targets/131772 https://cumbriacrack.com/ https://www.webarchive.org.uk/act/wayback/archive/20220730101655/https://cumbriacrack.com/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/149065 https://ort.org/ https://www.webarchive.org.uk/act/wayback/archive/20220517094742/https://ort.org/ still an issue, cloudflare https://www.webarchive.org.uk/act/targets/164270 https://www.vistrygroup.co.uk/ https://www.webarchive.org.uk/act/wayback/archive/20220716100156/https://www.vistrygroup.co.uk/ not crawling, -5002