ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Domain crawl tripping 'abuse' alerts #84

Open anjackson opened 1 year ago

anjackson commented 1 year ago

The 2022 crawl seems to be hitting a lot of 'abuse' alerts, which get automatically or semi-automatically routed to our hosting provider. Recently this shows up as captcha failures, but from the BitNinja docs this is a likely a reaction to earlier crawler activity. In particular, based on other reports generated by fail2ban, it seems likely that the scanning for well-known URIs might the issue. Because we are scanning for a few, and do this in quick succession, this will generate a short burst of 404s.

Looking at an example site, this seems plausible, as the lock down appears to start fairly shortly after six 404 requests for 'well-known URIs'.

Note that I think it's also possible that repeated requests for robots.txt (as expected) is leading to multiple requests for other well-known URIs (which should only be requested once).

anjackson commented 1 year ago

The changes in c1d4d899cbc7b7698cfb599d5ced7f810637e749 make it possible to override the list of well-known URIs in the crawler beans. This makes it possible for us to reduce or switch that off easily if needed.