Closed issuefiler closed 2 years ago
Zero impact on performance, thousands of users without issue.
@mitchellkrogza Great. And thank you for the great blacklist. I’m using it.
By the way, I’m curious, what makes thousands of string comparisons computationally cheap (“zero impact on performance”)?
The entire globalblacklist.conf is only 510kb it loads into Nginx memory once only and stays resident in memory all the time which is what makes it so damn fast, and this is the size the list is at present despite all the additions over the years since I started it.
But still, it does thousands of matching against regular expressions loaded in the memory, per request. It’s not a matter of memory usage or whether or not they are compiled and loaded in the memory. I know not much about the internals of NGINX, but as far as I know, NGINX’s map
works like an O(1) hash map only for plain strings, and for regular expressions, it just does sequential O(n) matching against every expression; no hashes, no trees. If that’s the case⸺….
map
directivesOn Sat, Sep 03, 2011 at 05:11:07PM +0300, Calin Don wrote:
Hi,
I understood that if you have only strings in a
map
directive the access time is O(1).Yes.
What if you have the
hostnames
directive enabled and some of the items are in the format*.example.com
?nginx tests address by its parts, first “
com
,” then “example
.” So access of the operation varies from O(1) to O(N) where N is number of parts of longest name. For example, if you testexample.net
only against
*.example.com
*.sub.domain.com
then this will be O(1). If you test
www.sub.domain.com
, it will be O(3).What about if you mix with regular expressions?
The regular expressions tested sequentially.
nginx tests
map
in the following order:
- exact names,
*.names
,names.*
,- regexes.
The first match stops the tesing.
-- Igor Sysoev
nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx
Considering thousands of the users without reported performance issues, it’s a good thing that processors nowadays are faster than I thought. Though, what I’m worrying is:
Several milliseconds is an instant to notice, but that’s not how you measure the performance of software. Because you can run it many, many, many times for a moment. Imagine malloc()
taking as much.
But my application wouldn’t be able to live without this. So I want to talk about optimization. Hopefully I’m not being paranoiac.
This blocker does thousands, if not millions, of matching against NGINX regular expressions like "~*(?:\b)000free\.us(?:\b)"
a second, that is, case-conversion performed every matching (not sure NGINX is smart enough to group all the ~*
expressions and perform this just once), looking for specific substrings, matching in O(n).
If we could take the host of the HTTP referer, trim and en-lower-case it preliminarily, we could perform a plain string map
ping of O(1), because NGINX uses hash maps for plain string mapping and hash maps allow O(1) lookups; meaning it’ll cost the same amount of processing power, however huge the blacklist is. And the map
directives will look cleaner without those regular expression things: "~*\b000free\.us\b"
→ "000free.us"
. What do you think?
The regular expressions are for mapping purposes; everything matched will be discarded shortly. What’s the point of those non-capturing groups for empty word boundaries (?:\b)
? Why not "~*\b000free\.us\b"
?
The original regex I had was ~*\b000free.us\b but if memory serves me correct (it will be here in issues somewhere) this approach was to prevent any false positives.
Yeah, it is issue https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/271 and the change \b
→ (?:\b)
happened two years ago, on June 25, 2019. For this one, let’s talk over there.
Regex on referrers needs to be updated to latest format.
After some time switching the builds from Travis to GHA, yesterday's tests once again revealed random failures in detections using the old boundaries. I will have to test this more when I have time.
Here is one example why I implemented word boundaries https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/441
Why this has been closed? I cannot find any answer to the concerns on https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/438#issuecomment-863484580 by @issuefiler about performance I am referring to this part:
Optimization: pre-normalization and plain string mapping, O(1). If we could take the host of the HTTP referer, trim and en-lower-case it preliminarily, we could perform a plain string mapping of O(1), because NGINX uses hash maps for plain string mapping and hash maps allow O(1) lookups; meaning it’ll cost the same amount of processing power, however huge the blacklist is. And the map directives will look cleaner without those regular expression things: "~*\b000free.us\b" → "000free.us". What do you think?
Thanks
Among those unnecessarily complicated files for a supposedly simple bot blocker, it seems
go-http-client
,axios
,guzzlehttp
,puppeteer
, of dumb bots involved in DDoS attacksare enough for me to build the blacklist.
Anyway, the question is — that’s a huge amount of string comparisons, wouldn’t it degrade the NGINX performance?