haraka / haraka-plugin-limit

Enforce many types of limits on a Haraka mail server.
https://www.npmjs.com/package/haraka-plugin-limit
MIT License
10 stars 14 forks source link

For frequent clients, rate_conn data in Redis grows forever #65

Closed jonmz closed 3 months ago

jonmz commented 3 months ago

Describe the bug

When enabling the rate_conn setting, the limit plugin creates/updates a Redis key named rate_conn:<IP> which has a hash as its value. The hash uses timestamps as keys, with a counter updated by HINCRBY.

The TTLS of the rate_conn:<IP> key is being set/reset to the configured window of time times two on every connection, e.g. for a setting of 60m it's 7200, and every new connection from the same IP resets the TTL.

The hash keys, however, have no TTLs on their own, which means, they only expire when the whole key rate_conn:<IP> expires.

This leads to a situation where a client that connects regularly within the configured time window, the hash is only growing all the time. One very prominent and common case of a client that connects regularly is: Monitoring!

Expected behavior

I expected outdated timestamps within the rate_conn:<IP> Redis key to expire by some plugin logic, as Redis doesn't have a TTL per hash key (AFAIK), keeping the dataset current and small.

Observed behavior

No expiration of hash keys is ever happening. The hash grows forever, as long as the same IP address connects again within 7200 seconds.

In our case, our monitoring does a short SMTP check every minute, each one generating a timestamp hash key under the according rate_conn:<IP> key. This is 60 new hash keys per hour, 1.400 a day, 43.200 a month, 511.000 per year. In our case, after about 2,5 years, we discovered about 1.300.000 timestamped hash keys for our monitoring IP in Redis.

This explained why the Haraka processes over time used immense CPU resources and immense memory: Reading 1,3 million hash keys on every monitoring connection into memory, doing 1,3 million quick calculations if the timestamp lies within the configured 60m and hence ignoring basically all of them except the most current 60 - that took a quite a toll on our servers. The moment we stopped monitoring our Haraka ports, cluster load basically halved and Haraka memory usage dropped by about 2 GB - per host.

Steps To Reproduce

  1. Assuming an already configured Redis, configure a limit in config/limit.ini, in our case:

    [rate_conn]
    enabled=true
    default=200/60m
  2. Regularly make SMTP connections from the same IP, e.g. every minute. In our case we set up a blackbox_exporter on localhost:9115 as we're using Prometheus for monitoring, and do a request every minute:

    $ for REQ in `seq 1 1000` ; do curl "http://localhost:9115/probe?target=our.test.host:465&module=smtp_c2s_tls_v4" ; sleep 60 ; done

    Any other means of SMTP should work as well, e.g. using Swaks.

  3. Check the actual Redis data, replacing 111.222.333.444 with the actual IP you use for testing:

    $ redis-cli -s /path/to/your/redis.socket -n 4 hkeys rate_conn:111.222.333.444 | wc -l

    Expectation would be to find timestamps no older than the configured 60m, so in this case, about 60. Reality is you'll find timestamps of any age.

System Info

Haraka Haraka.js — Version: 3.0.1
Node v18.20.3
OS Linux fairydust.uberspace.de 3.10.0-1160.119.1.el7.x86_64 #1 SMP Tue Jun 4 14:43:51 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
openssl OpenSSL 1.0.2k-fips 26 Jan 2017

(It's an old CentOS 7 host which is about to be replaced soon, but I don't think the environment is relevant here anyway.)

jonmz commented 3 months ago

I'm not too familiar with Javascript, but a basic idea could be: After doing the calculation, if a given timestamp is a recent one, don't simply ignore older values, but actively remove them from Redis. Roughly like this (again, I'm not too familiar with Javascript, so YMMV):

--- index.js.orig       2024-07-15 16:37:40.252588366 +0000
+++ index.js    2024-07-15 16:40:40.215693968 +0000
@@ -511,7 +511,10 @@

     let connections_in_ttl_period = 0
     for (const ts of Object.keys(tstamps)) {
-      if (parseInt(ts, 10) < periodStartTs) continue // older than ttl
+      if (parseInt(ts, 10) < periodStartTs) { // older than ttl
+        this.db.hDel(`rate_conn:${key}`, ts);
+        continue
+      }
       connections_in_ttl_period =
         connections_in_ttl_period + parseInt(tstamps[ts], 10)
     }
msimerson commented 3 months ago

Good sleuthing and excellent description of the problem. Your solution looks just fine. May I look forward to a Pull Request?