[Question] Would this slow down my NGINX?

issuefiler commented 3 years ago

Among those unnecessarily complicated files for a supposedly simple bot blocker, it seems

_generator_lists/bad-ip-addresses.list
_generator_lists/fake-googlebots.list
_generator_lists/nibbler-seo.list
_generator_lists/wordpress-theme-detectors.list
bots.d/blacklist-ips.conf
_generator_lists/bad-referrers.list
_generator_lists/bad-user-agents.list
additional module-default UAs, such as go-http-client, axios, guzzlehttp, puppeteer, of dumb bots involved in DDoS attacks

are enough for me to build the blacklist.

Anyway, the question is — that’s a huge amount of string comparisons, wouldn’t it degrade the NGINX performance?

mitchellkrogza commented 3 years ago

Zero impact on performance, thousands of users without issue.

issuefiler commented 3 years ago

@mitchellkrogza Great. And thank you for the great blacklist. I’m using it.

By the way, I’m curious, what makes thousands of string comparisons computationally cheap (“zero impact on performance”)?

mitchellkrogza commented 3 years ago

The entire globalblacklist.conf is only 510kb it loads into Nginx memory once only and stays resident in memory all the time which is what makes it so damn fast, and this is the size the list is at present despite all the additions over the years since I started it.

issuefiler commented 3 years ago

Its O(n)-matching-per-request nature

But still, it does thousands of matching against regular expressions loaded in the memory, per request. It’s not a matter of memory usage or whether or not they are compiled and loaded in the memory. I know not much about the internals of NGINX, but as far as I know, NGINX’s map works like an O(1) hash map only for plain strings, and for regular expressions, it just does sequential O(n) matching against every expression; no hashes, no trees. If that’s the case⸺….

Igor Vladimirovich Sysoev’s explanation on the internals of the `map` directives

On Sat, Sep 03, 2011 at 05:11:07PM +0300, Calin Don wrote:

Hi,

I understood that if you have only strings in a map directive the access time is O(1).

Yes.

What if you have the hostnames directive enabled and some of the items are in the format *.example.com?

nginx tests address by its parts, first “com,” then “example.” So access of the operation varies from O(1) to O(N) where N is number of parts of longest name. For example, if you test example.net only against

*.example.com

*.sub.domain.com

then this will be O(1). If you test www.sub.domain.com, it will be O(3).

What about if you mix with regular expressions?

The regular expressions tested sequentially.

nginx tests map in the following order:

exact names,

*.names,

names.*,

regexes.

The first match stops the tesing.

-- Igor Sysoev

nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx

and my worries on that

Considering thousands of the users without reported performance issues, it’s a good thing that processors nowadays are faster than I thought. Though, what I’m worrying is:

Insufficient stress tests — It’s not battle-tested enough for high-load situations.
Being more vulnerable — As this works on a per-request basis and thousands of sequential regular-expression matchings are performed for each, processor resources could be exhausted much quicker on high load_{(you know what this implies)} and I’ll be using this for an application in the production environment.
A production-significant latency — Thousands of regular-expression matchings, which probably costs several milliseconds a request, are not quite cheap for production-level applications where response time should be taken seriously. When a redirection happens, it adds up. On a burst, it adds up.
Being the limiting factor — Normally NGINX’s capable of handling dozens of thousands of requests per second; when a single request takes several milliseconds worth of processing power, this could kill its robustness, being the limiting factor. And, the processor resources are not solely for NGINX. There’s much more important stuff behind it to use the resources for.

Several milliseconds is an instant to notice, but that’s not how you measure the performance of software. Because you can run it many, many, many times for a moment. Imagine malloc() taking as much.

A few more questions about possible optimization

But my application wouldn’t be able to live without this. So I want to talk about optimization. Hopefully I’m not being paranoiac.

This blocker does thousands, if not millions, of matching against NGINX regular expressions like "~*(?:\b)000free\.us(?:\b)" a second, that is, case-conversion performed every matching (not sure NGINX is smart enough to group all the ~* expressions and perform this just once), looking for specific substrings, matching in O(n).

Optimization: pre-normalization and plain string mapping, O(1).

If we could take the host of the HTTP referer, trim and en-lower-case it preliminarily, we could perform a plain string mapping of O(1), because NGINX uses hash maps for plain string mapping and hash maps allow O(1) lookups; meaning it’ll cost the same amount of processing power, however huge the blacklist is. And the map directives will look cleaner without those regular expression things: "~*\b000free\.us\b" → "000free.us". What do you think?

Question: those non-capturing groups.

The regular expressions are for mapping purposes; everything matched will be discarded shortly. What’s the point of those non-capturing groups for empty word boundaries (?:\b)? Why not "~*\b000free\.us\b"?

mitchellkrogza commented 3 years ago

The original regex I had was ~*\b000free.us\b but if memory serves me correct (it will be here in issues somewhere) this approach was to prevent any false positives.

issuefiler commented 3 years ago

Yeah, it is issue https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/271 and the change \b → (?:\b) happened two years ago, on June 25, 2019. For this one, let’s talk over there.

Regex on referrers needs to be updated to latest format.

mitchellkrogza commented 3 years ago

After some time switching the builds from Travis to GHA, yesterday's tests once again revealed random failures in detections using the old boundaries. I will have to test this more when I have time.

mitchellkrogza commented 3 years ago

Here is one example why I implemented word boundaries https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/441

planetahuevo commented 2 years ago

Why this has been closed? I cannot find any answer to the concerns on https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/438#issuecomment-863484580 by @issuefiler about performance I am referring to this part:

Optimization: pre-normalization and plain string mapping, O(1). If we could take the host of the HTTP referer, trim and en-lower-case it preliminarily, we could perform a plain string mapping of O(1), because NGINX uses hash maps for plain string mapping and hash maps allow O(1) lookups; meaning it’ll cost the same amount of processing power, however huge the blacklist is. And the map directives will look cleaner without those regular expression things: "~*\b000free.us\b" → "000free.us". What do you think?

Thanks

mitchellkrogza / nginx-ultimate-bad-bot-blocker