The 2022 crawl seems to be hitting a lot of 'abuse' alerts, which get automatically or semi-automatically routed to our hosting provider. Recently this shows up as captcha failures, but from the BitNinja docs this is a likely a reaction to earlier crawler activity. In particular, based on other reports generated by fail2ban, it seems likely that the scanning for well-known URIs might the issue. Because we are scanning for a few, and do this in quick succession, this will generate a short burst of 404s.
Looking at an example site, this seems plausible, as the lock down appears to start fairly shortly after six 404 requests for 'well-known URIs'.
Note that I think it's also possible that repeated requests for robots.txt (as expected) is leading to multiple requests for other well-known URIs (which should only be requested once).
The changes in c1d4d899cbc7b7698cfb599d5ced7f810637e749 make it possible to override the list of well-known URIs in the crawler beans. This makes it possible for us to reduce or switch that off easily if needed.
The 2022 crawl seems to be hitting a lot of 'abuse' alerts, which get automatically or semi-automatically routed to our hosting provider. Recently this shows up as captcha failures, but from the BitNinja docs this is a likely a reaction to earlier crawler activity. In particular, based on other reports generated by
fail2ban
, it seems likely that the scanning for well-known URIs might the issue. Because we are scanning for a few, and do this in quick succession, this will generate a short burst of 404s.Looking at an example site, this seems plausible, as the lock down appears to start fairly shortly after six
404
requests for 'well-known URIs'.Note that I think it's also possible that repeated requests for
robots.txt
(as expected) is leading to multiple requests for other well-known URIs (which should only be requested once).