Open dalf opened 5 years ago
From my experience, apart from my project all these resources are not really effective because when running a Searx instance that use Google you have to very quickly identify the bots (before the bot do 2 requests at maximum) because if you fail to do so the instance will get quickly blocked by Google. Moreover, restricting the number of requests per seconds gives a bad experience for the normal users because they can't navigate quickly between the categories or browsing quickly the next pages results.
I found some tools that can be used to defend against bots:
Thank you ! Some feed backs
https://github.com/fnzv/net-Shield/blob/master/shield.go : it blocks IP from http://iplists.firehol.org/ (see https://github.com/firehol/blocklist-ipsets )
it set up some iptables rules :
Not sure it will help.
Way too much dependencies : https://github.com/theraw/The-World-Is-Yours/blob/master/install
Some of them :
ModSecurity can be interresting, not sure :
I think that the one way is TLS fingerprint : whatever the sent HTTP headers, the cipher suites are related to the client. Look at https://browserleaks.com/ssl : you will have a different cipher suites using Curl or Firefox, even you "copy URL as Curl command".
Of course it is possible to tweak this, it is an additional safety net.
This is actually the way Caddy detects MITM : https://github.com/caddyserver/caddy/blob/master/caddyhttp/httpserver/mitm.go
Basically, there is a problem if the request comes from Firefox but it doesn't :
About Caddy, see :
About Nginx, see :
Still about Nginx, it is possible to execute Lua code : https://github.com/openresty/lua-nginx-module#name Not sure if it could help in a way or another.
Another link : Protocol for bypassing challenge pages using RSA blind signed tokens draft-protocol-challenge-bypass-00
The issue with TLS fingerprint is it would requires to implement a verification for every browser that support TLS 1.2 and TLS 1.3 which is pretty long task to do. On my searx instance I've a wide variety of browser that use it from some weird Chinese browsers to Google Chrome. Personally it would be an overkill task to do. Moreover some of the users change their user agent for privacy reasons which if TLS fingerprint is implemented would block their access.
On my antibot-proxy project I deployed two days ago a similar sticky cookie protection that Tempesta FW use and it reduced the amount of bots that reached my searx instance by 70%! I'm not sure how long it will stay like that but I was right the bad bots that do ranking manipulation on the public instances are really badly coded.
I have a loads of ideas to block the bots and I plan to refactor completely my project after my vacation so that it would be usable by the vast majority of searx public instance owners.