searx / searx-docker

Create a searx instance using Docker
GNU Affero General Public License v3.0
404 stars 67 forks source link

antibot: how ? #2

Open dalf opened 5 years ago

dalf commented 5 years ago
unixfox commented 5 years ago

From my experience, apart from my project all these resources are not really effective because when running a Searx instance that use Google you have to very quickly identify the bots (before the bot do 2 requests at maximum) because if you fail to do so the instance will get quickly blocked by Google. Moreover, restricting the number of requests per seconds gives a bad experience for the normal users because they can't navigate quickly between the categories or browsing quickly the next pages results.

unixfox commented 5 years ago

I found some tools that can be used to defend against bots:

dalf commented 5 years ago

Thank you ! Some feed backs

Tempesta FW

use cases :

net-Shield

https://github.com/fnzv/net-Shield/blob/master/shield.go : it blocks IP from http://iplists.firehol.org/ (see https://github.com/firehol/blocklist-ipsets )

it set up some iptables rules :

Not sure it will help.

Nginx L7 DDoS Protection ==

Way too much dependencies : https://github.com/theraw/The-World-Is-Yours/blob/master/install

Some of them :

ModSecurity can be interresting, not sure :

Note

I think that the one way is TLS fingerprint : whatever the sent HTTP headers, the cipher suites are related to the client. Look at https://browserleaks.com/ssl : you will have a different cipher suites using Curl or Firefox, even you "copy URL as Curl command".

Of course it is possible to tweak this, it is an additional safety net.

This is actually the way Caddy detects MITM : https://github.com/caddyserver/caddy/blob/master/caddyhttp/httpserver/mitm.go

Basically, there is a problem if the request comes from Firefox but it doesn't :

About Caddy, see :

About Nginx, see :


Still about Nginx, it is possible to execute Lua code : https://github.com/openresty/lua-nginx-module#name Not sure if it could help in a way or another.


Another link : Protocol for bypassing challenge pages using RSA blind signed tokens draft-protocol-challenge-bypass-00

unixfox commented 5 years ago

The issue with TLS fingerprint is it would requires to implement a verification for every browser that support TLS 1.2 and TLS 1.3 which is pretty long task to do. On my searx instance I've a wide variety of browser that use it from some weird Chinese browsers to Google Chrome. Personally it would be an overkill task to do. Moreover some of the users change their user agent for privacy reasons which if TLS fingerprint is implemented would block their access.

On my antibot-proxy project I deployed two days ago a similar sticky cookie protection that Tempesta FW use and it reduced the amount of bots that reached my searx instance by 70%! I'm not sure how long it will stay like that but I was right the bad bots that do ranking manipulation on the public instances are really badly coded.

I have a loads of ideas to block the bots and I plan to refactor completely my project after my vacation so that it would be usable by the vast majority of searx public instance owners.