antibot: how ? - Githubissues

dalf commented 5 years ago

unixfox commented 5 years ago

From my experience, apart from my project all these resources are not really effective because when running a Searx instance that use Google you have to very quickly identify the bots (before the bot do 2 requests at maximum) because if you fail to do so the instance will get quickly blocked by Google. Moreover, restricting the number of requests per seconds gives a bad experience for the normal users because they can't navigate quickly between the categories or browsing quickly the next pages results.

unixfox commented 5 years ago

I found some tools that can be used to defend against bots:

Tempesta FW: A firewall that support sticky cookie which is a method to filter bots that doesn't support cookies. The downside is that this software is really complicated to install if you aren't using Debian 9.
testcookie an NGINX module: Works similarly as the sticky cookie of Tempesta but is not supported anymore and rely on using only nginx (probably an older version of nginx).
net-Shield: A reverse proxy that acts as an anti ddos for HTTP(S) requests. Details about how it works here, in summary it uses the blacklist of firehol.
Nginx L7 DDoS Protection: Use the same method as tempesta and testcookie for protection against bots. Rely on nginx but is easy to install.

dalf commented 5 years ago

Thank you ! Some feed backs

Tempesta FW

use cases :

Clouds : nope
High availability : nope
DDoS mitigation : nope
Web security : not sure https://github.com/tempesta-tech/tempesta/wiki/Web-security
WAF acceleration : https://github.com/tempesta-tech/tempesta/issues/731 but overkill (kernel modifications)

net-Shield

https://github.com/fnzv/net-Shield/blob/master/shield.go : it blocks IP from http://iplists.firehol.org/ (see https://github.com/firehol/blocklist-ipsets )

it set up some iptables rules :

iptables -A INPUT -m set --match-set ratelimit src -m hashlimit --hashlimit 25/sec --hashlimit-name ratelimithash -j DROP
iptables -A INPUT -m set --match-set block src -j DROP
iptables -A INPUT -p tcp ! --syn -m state --state NEW -j DROP
iptables -A INPUT -p tcp --tcp-flags ALL ALL -j DROP
iptables -A INPUT -p icmp -m icmp --icmp-type timestamp-request -j DROP && iptables -A INPUT -p icmp -m limit --limit 1/second -j ACCEPT
etc...

Not sure it will help.

Nginx L7 DDoS Protection ==

Way too much dependencies : https://github.com/theraw/The-World-Is-Yours/blob/master/install

Some of them :

Nginx Development Kit - https://github.com/simplresty/ngx_devel_kit
incubator-pagespeed-ngx - https://github.com/apache/incubator-pagespeed-ngx
NAXSI - https://github.com/nbs-system/naxsi - NAXSI means Nginx Anti XSS & SQL Injection.
openresty/set-misc-nginx-module - https://github.com/openresty/set-misc-nginx-module
ModSecurity - https://github.com/SpiderLabs/ModSecurity
and many others

ModSecurity can be interresting, not sure :

https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual-%28v2.x%29#Configuration_Directives (no user manual for the new version ?)
but https://github.com/SpiderLabs/ModSecurity-nginx/issues/162
see also : https://coreruleset.org/

Note

I think that the one way is TLS fingerprint : whatever the sent HTTP headers, the cipher suites are related to the client. Look at https://browserleaks.com/ssl : you will have a different cipher suites using Curl or Firefox, even you "copy URL as Curl command".

Of course it is possible to tweak this, it is an additional safety net.

This is actually the way Caddy detects MITM : https://github.com/caddyserver/caddy/blob/master/caddyhttp/httpserver/mitm.go

Basically, there is a problem if the request comes from Firefox but it doesn't :

use TLS 1.3 : https://www.caniuse.com/#search=TLS%201.3 (supported since Firefox 63, Tor Browser doesn't support it : https://trac.torproject.org/projects/tor/ticket/27535 )
support Brotli : https://www.caniuse.com/#search=brotli
use HTTP2 : https://www.caniuse.com/#search=http2
have the usual HTTP headers

About Caddy, see :

https://caddyserver.com/docs/placeholders : {mitm}, {tls_cipher}, {tls_protocol}, {proto}

About Nginx, see :

https://nginx.org/en/docs/http/ngx_http_ssl_module.html#variables : $ssl_cipher, $ssl_ciphers, $ssl_curves, $ssl_protocol,
https://nginx.org/en/docs/http/ngx_http_core_module.html#variables : $server_protocol

Still about Nginx, it is possible to execute Lua code : https://github.com/openresty/lua-nginx-module#name Not sure if it could help in a way or another.

Another link : Protocol for bypassing challenge pages using RSA blind signed tokens draft-protocol-challenge-bypass-00

unixfox commented 5 years ago

The issue with TLS fingerprint is it would requires to implement a verification for every browser that support TLS 1.2 and TLS 1.3 which is pretty long task to do. On my searx instance I've a wide variety of browser that use it from some weird Chinese browsers to Google Chrome. Personally it would be an overkill task to do. Moreover some of the users change their user agent for privacy reasons which if TLS fingerprint is implemented would block their access.

On my antibot-proxy project I deployed two days ago a similar sticky cookie protection that Tempesta FW use and it reduced the amount of bots that reached my searx instance by 70%! I'm not sure how long it will stay like that but I was right the bad bots that do ranking manipulation on the public instances are really badly coded.

I have a loads of ideas to block the bots and I plan to refactor completely my project after my vacation so that it would be usable by the vast majority of searx public instance owners.

searx / searx-docker

antibot: how ? #2

Tempesta FW

net-Shield

Nginx L7 DDoS Protection ==

Note