rspamd / rspamd

Rapid spam filtering system.
Other
2.02k stars 378 forks source link

[BUG] Condition where the bayes_expiry loop fails because the memory is full #4921

Open japc opened 5 months ago

japc commented 5 months ago

Prerequisites

Describe the bug

Have a kind of deadlock situation, where the bayes_expiry fails because the memory is full:

2024-04-12 08:46:37 #501(controller) <djphd3>; lua; bayes_expiry.lua:408: cannot perform expiry step: ERR Error running script (call to f_acab6c1e47b0f81a1f2e58a41611fed28f249daf): @user_script:36: @user_script: 36: -OOM command not allowed when used memory > 'maxmemory'.
2024-04-12 08:46:37 #501(controller) <bd8064>; csession; rspamd_redis_learned: cannot learn task: ERR Error running script (call to f_18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6): @user_script:1: @user_script: 1: -OOM command not allowed when used memory > 'maxmemory'.
2024-04-12 08:46:37 #501(controller) <bd8064>; csession; rspamd_task_process: learn error: ERR Error running script (call to f_18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6): @user_script:1: @user_script: 1: -OOM command not allowed when used memory > 'maxmemory'.

As seen from the logs, the learn processes also fails (and possibly not triggering any key setting that would make redis evict keys according to the policy, and free some memory?).

My redis mm configuration is:

maxmemory 6GB
maxmemory-policy volatile-ttl

Steps to Reproduce

Give it some time until enough statistical tokens fill the database.

Expected behavior Should work as described on the Bayes expiry module documentation, with the memory optimally full, but functional.

Versions

$ rspamd --version
Rspamd daemon version 3.9.0

CPU architecture x86_64; features: avx2, avx, sse2, sse3, ssse3, sse4.1, sse4.2, rdrand
Hyperscan enabled: TRUE
Jemalloc enabled: FALSE
LuaJIT enabled: TRUE (LuaJIT version: LuaJIT 2.1.0-beta3)
ASAN enabled: FALSE
BLAS enabled: FALSE
Fasttext enabled: FALSE

(rspamd is compiled from git 5f8ea684d, but this has alway happened, for as far as I can remember)

OS: Debian 12.5 64bits

Redis server v=6.2.7 sha=00000000:0 malloc=jemalloc-5.1.0 bits=64 build=8d5678d625cf855c

Additional Information

As a workaround we temporarily increase the redis memory and let rspamd go with a bit more agressive bayes_expiry configuration for a while.

dragoangel commented 5 months ago

@japc check how much elements you have in learned_ids, this almost only one key without TTL set, previously there was a bug it's doesn't rotated at all. Now it should be splitted by 10k elements into 5 keys which looks like learned_ids, learned_idsX, learned_idsXX, learned_idsXXX, etc.

japc commented 5 months ago

@dragoangel Have learned_ids, learned_idsX, ..., learned_idsXXXX, with 10k elements each, and learned_idsXXXXX with 7k and growing:

127.0.0.1:6378> hlen learned_ids
(integer) 10000
127.0.0.1:6378> hlen learned_idsX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXXX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXXXX
(integer) 10000
127.0.0.1:6378> hlen learned_idsXXXXX
(integer) 7048
dragoangel commented 5 months ago

well looks fine. I honestly don't understand how you get to OOM command not allowed when used memory > 'maxmemory'. while you had maxmemory 6GB && maxmemory-policy volatile-ttl because in this case Redis RESPONSIBLE for cleaning memory and not getting into OOM. What I tried to say: this not Rspamd fault here at all, and error you see - this Redis message which was replied to Rspamd, as of that - what you expect Rspamd could do with such error in Redis? Answer: nothing.

I don't know how "stable" you can reproduce this issue and it would be nice to know, but you need report it to Redis instead of Rspamd 😉. As of your Redis version, before reporting any issues, I would advice you to update to latest one - 7.2.4 and check if issue would still in place. I assume with moving to 7.2.4 you will fail to getting such OOM.

japc commented 5 months ago

The ball is on the redis side probably. I was thinking of a weird play of low or OOM on redis with the lua script eval (following some of the considerations on "Execution under low memory conditions" here: https://redis.io/docs/latest/develop/interact/programmability/eval-intro/).

I set up a small lab, with same versions and configurations of everything, with 40MB as redis memory limit, and COULD NOT replicate the problem. Tried first with 10MB and the error happened, but could be because it was too small for proper use.

In production the issue occurred 2 or 3 times in the last couple of months. Will try to see if it happens as soon as the memory fills or if it works as should (redis evicting and keeping being usable) for a while and then enters the OOM.

dragoangel commented 5 months ago

Well 2 or 3 times in months is not an issue, it will run later and get success.

japc commented 5 months ago

That's 2 or 3 times where it started failing altogether and we had to recover with the workaround I mentioned, not 2 or 3 bayes_expiry failed steps between successes.

dragoangel commented 5 months ago

That's 2 or 3 times where it started failing altogether and we had to recover with the workaround I mentioned, not 2 or 3 bayes_expiry failed steps between successes.

why not try considering update to 7.2?

japc commented 4 months ago

why not try considering update to 7.2?

It is planned, but that and trying to understand why it is happening are not mutually exclusive. Even because I'm not sure if the upgrade to 7.2 will change the behaviour.

It's happening right now. Started at 04h18 for the redis learn, and some minutes later for the bayes_expiry:

2024-04-19 04:18:08 #6127(normal) <a9a7a5>; task; rspamd_redis_learned: cannot learn task: ERR Error running script (call to f_18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6): @
user_script:1: @user_script: 1: -OOM command not allowed when used memory > 'maxmemory'.

2024-04-19 04:22:57 #4571(controller) <kbbx9s>; lua; bayes_expiry.lua:408: cannot perform expiry step: ERR Error running script (call to f_acab6c1e47b0f81a1f2e58a41611fed2
8f249daf): @user_script:36: @user_script: 36: -OOM command not allowed when used memory > 'maxmemory'.

Redis seems (from the logs) to keep having writes until about 06h31.

Any suggestion, to try to get some more debug info, before I fix it?

moisseev commented 4 months ago

My redis mm configuration is:

maxmemory 6GB
maxmemory-policy volatile-ttl

Are you sure your configuration is actually being loaded? Does # redis-cli -p 6378 info memory | grep maxmemory show the same policy as configured (maxmemory_policy:volatile-ttl) ?

japc commented 4 months ago

My redis mm configuration is:

maxmemory 6GB
maxmemory-policy volatile-ttl

Are you sure your configuration is actually being loaded? Does # redis-cli -p 6378 info memory | grep maxmemory show the same policy as configured (maxmemory_policy:volatile-ttl) ?

Pretty sure:

$ redis-cli -p 6378 info memory | grep memory
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
used_memory:7518842472
used_memory_human:7.00G
...
maxmemory:7516192768
maxmemory_human:7.00G
maxmemory_policy:volatile-ttl

Don't know much about how aggressive the redis evictions are, the redis memory does seem to go down some MBs from time to time. But both the rspamd_redis_learned and bayes_expiry have been consistently failing for hours.

dragoangel commented 4 months ago

@japc you using separate redis for bayes or you using shared redis that contains other modules as well? If this shared redis - try moving bayes to dedicated redis or start to use 1 db for all modules if this not the case. Redis eviction working strange on multi-db instances with violate ttl. And I don't understand why you not try to update to more fresh version. Rspamd works good with 7.2. If you so panic about moving on new version, backup rbd and update, in any case you can revert to older version from rbd backup.

japc commented 4 months ago

@dragoangel

It's a dedicated instance: have one default and history (6379), one for fuzzy (6377), one for classifier_bayer (6378), one for neural (6381) and one for reputation (6380).

As for the redis upgrade, it is scheduled, plus other considerations I've replied before.

japc commented 4 months ago

Upgraded redis to 7.2.4:

127.0.0.1:6378> info server
# Server
redis_version:7.2.4

After the redis restart the memory gone down a bit, but maxed out soon:

127.0.0.1:6378> info memory
# Memory
used_memory:7516561904
used_memory_human:7.00G
...
maxmemory:7516192768
maxmemory_human:7.00G
maxmemory_policy:volatile-ttl

To the same effect: both learn and expire and failing with OOM:

2024-04-20 00:05:58 #4571(controller) <kbbx9s>; lua; bayes_expiry.lua:408: cannot perform expiry step: OOM command not allowed when used memory > 'maxmemory'. script: acab6c1e47b0f81a1f2e58a41611fed28f249daf, on @user_script:36.
2024-04-20 00:05:58 #8991(normal) <e4f656>; task; rspamd_redis_learned: cannot learn task: OOM command not allowed when used memory > 'maxmemory'. script: 18c1f7e837d77f2d1d51cc491aa89365ed1c5bc6, on @user_script:1.
japc commented 4 months ago

This explains a bit:

127.0.0.1:6378> info keyspace
# Keyspace
db0:keys=69893012,expires=0,avg_ttl=0

Have the theory, see if it makes any sense, that, with the autolearn enabled, for our scenario, the token set rate exceeds the expiry rate to the point where we end with all keys with no ttl configured and redis unable to free memory.

A simple info keyspace; sleep 60; info keyspace shows a large increase on the keys vs expires, which seem to go along with that theory:

db0:keys=69933904,expires=40030,avg_ttl=863517322
db0:keys=69946176,expires=42869,avg_ttl=863452830

Being this the case, having some kind of expire on the cache_learn would help.

moisseev commented 4 months ago

db0:keys=69893012,expires=0,avg_ttl=0

avg_ttl = 0 suggests none of the keys had an ttl set. In this casevolatile-ttl behave like noeviction. Check if bayes expiry module is configured and working.

It should log messages like this about every minute:

2024-04-19 23:08:50 #16243(controller) <b5o8ft>; lua; bayes_expiry.lua:443: finished expiry step 754: 1003 items checked, 161 significant (5 made persistent), 25 insignificant (1 ttls set), 0 common (0 discriminated), 817 infrequent (2 ttls set), 2 mean, 9 std

And statistics when full cycle is done (depends on database size, about a week):

2024-04-19 04:15:01 #82966(controller) <b5o8ft>; lua; bayes_expiry.lua:443: finished expiry cycle in 4182 steps: 4183474 items checked, 700674 significant (19775 made persistent), 77716 insignificant (31 ttls set), 670 common (126 discriminated), 3404414 infrequent (23052 ttls set), 3 mean, 20 std
japc commented 4 months ago

@moisseev:

The bayes expiry module is configured and working.

As I said, it just doesn't seem to be setting ttls fast enough for the new tokens being added (without ttl), and the DB ends up full, with no keys eligible for eviction, and no more writes (either new keys or expires on the existing).

dragoangel commented 4 months ago

@moisseev:

The bayes expiry module is configured and working.

As I said, it just doesn't seem to be setting ttls fast enough for the new tokens being added (without ttl), and the DB ends up full, with no keys eligible for eviction, and no more writes (either new keys or expires on the existing).

What ttl you set for bayes recods? This doesn't have any sense as Moiseev said. I have exactly same env and can't reproduce your issues. Show please your statistics.conf, I assume you set ttl in config to 0, which should not be used 🤣

japc commented 4 months ago

@dragoangel

Let me help you visualize. I've increase the maxmemory to a bit over the used memory. This is the output of info keyspace once each 60s:

     max_memory      used_memory        keys     expires
     7090000000       7018502016    70069493        1889
     7090000000       7032416960    70081869        6529
     7090000000       7025780096    70076159        3755
     7090000000       7037167640    70085911        8378
     7090000000       7044163360    70089336       12138
     7090000000       7050636544    70094485       13993
     7090000000       7056466136    70099361       15880
     7090000000       7061918592    70104532       18665
     7090000000       7068456624    70108818       20519
     7090000000       7074640480    70112359       24223
     7090000000       7080812968    70118271       27019
     7090000000       7086820608    70127044       30800
     7090000000       7090074792    70098710           0
     7090000000       7090231216    70099355           0
     7090000000       7090353144    70099590           0
     7090000000       7090703600    70099678           0
     7090000000       7090311968    70099756           0

You can see the used_memory increasing along with the keys. The expiry module kicks in and sets ttls (1000 per minute, per each frontend) but the keys increase too fast, to the point where redis memory gets full, it frees what it can, more ttl=0 keys are added and it stops being able to do anything (except reads).

My local.d/classifier-bayes.conf is

backend = "redis";
write_servers = "ip1:6378";
read_servers = "random:ip1:6378,ip2:6378,ip2:6378,ip4:6378";
password = "***********";
timeout = 20s;
new_schema = true;
autolearn = [1, 12 ];
lazy = true;
expire = 864000;
dragoangel commented 4 months ago

Use statistics conf instead of classifier bayes, this config is deprecated long time ago. Delete it, and read how-to's on statistics conf. New keys should be set by default with ttl, not without it afaik. More over random is not best option for selecting read replicas, remove this detective, as well as timeout in 20s. Also just curious how you can use password without user? Also as you using violate-ttl expire of 10d doesn't have much sense, you can put it to 1y, as anyway keys would be evicted by lowest ttl before expiration.

P.s. the only key should be at any time frame existing in your server is learned_ids (and X) of them, that's all. In short your configuration is wrong, leading to all this issues.

vstakhov commented 4 months ago

What do you expect if you learn faster than expire can recycle?

dragoangel commented 4 months ago

What do you expect if you learn faster than expire can recycle?

But statistics keys ttls are set right at stage of their creation and with violate ttl policy it should work just fine, no matter how much quickly it writes, no?

japc commented 4 months ago

But statistics keys ttls are set right at stage of their creation and with violate ttl policy it should work just fine, no matter how much quickly it writes, no?

Don't seem to, no. From the behaviour and from lualib/redis_scripts/bayes_learn.lua where, if I'm not mistaken, the token msgpack gets expanded and the keys are HSET.

What do you expect if you learn faster than expire can recycle?

So the only solution is to either reduce the learn rate or to increase the redis memory? Can't the keys be set with a (very very) long expire to prevent the problem?

moisseev commented 4 months ago

So the only solution is to either reduce the learn rate or to increase the redis memory? Can't the keys be set with a (very very) long expire to prevent the problem?

It will break expiry module's logic. You are suggesting to allow Redis to evict non-categorized (yet) tokens that will lead to statistics degradation.

You can decrease interval or increase count to speed up Bayes expiry module. interval = 60, -- one iteration step per minute count = 1000, -- check up to 1000 keys on each iteration

vstakhov commented 4 months ago

I totally agree with @moisseev here - this will likely break the whole learning procedure if we allow Redis to expire some random (potentially significant) tokens. If we want some memory footprint reduction, we might introduce unigrams mode (instead of bigrams that are used currently). That will lead to at least 5 times lesser databases but will potentially degrade Bayes mode to naive one. Another option is to reduce the window size (e.g. use only 2 joint bigramms instead of 5). All the options are possible and not very hard to implement, but they will require full stat reset to be used.

It's actually a discussion point here (@moisseev + @fatalbanana to highlight).

moisseev commented 4 months ago

Use statistics conf instead of classifier bayes, this config is deprecated long time ago.

@dragoangel Are you sure? May be I missed something, but I believe for most setups local.d/classifier-bayes.conf is a better place as it is included inside the classifier "bayes" {} section.

dragoangel commented 4 months ago

Use statistics conf instead of classifier bayes, this config is deprecated long time ago.

@dragoangel Are you sure? May be I missed something, but I believe for most setups local.d/classifier-bayes.conf is a better place as it is included inside the classifier "bayes" {} section.

Well I read that from the comments in config and docs, @moisseev

moisseev commented 4 months ago

Well I read that from the comments in config and docs, @moisseev

Probably we should correct that. @dragoangel Can you point me to the exact places? I can't find it.

japc commented 4 months ago

It will break expiry module's logic. You are suggesting to allow Redis to evict non-categorized (yet) tokens that will lead to statistics degradation.

I get it. Just under the assumption that a just seen token could be categorized as "infrequent". But that could lead to it being evicted too soon and the degradation that you point out.

You can decrease interval or increase count to speed up Bayes expiry module.

Been doing that. That's what I meant with "a bit more aggressive bayes_expiry" on the 1st post.

Will check what we can do to control the auto_learn rate (already have learning on user reports and spamtraps so we can live with that).

dragoangel commented 4 months ago

Well I read that from the comments in config and docs, @moisseev

Probably we should correct that. @dragoangel Can you point me to the exact places? I can't find it.

@moisseev https://github.com/rspamd/rspamd/blob/master/conf/statistic.conf#L11-L14 Well, it phrased a bit in different way it was stored in my head 😅, but main point that I remembered to not used it, and everything that can be set in classifier-bayes.conf can be set in statistic.conf but for multiply bayeses f.e.

dragoangel commented 4 months ago

Btw, @japc how much replicas of rspamd normal\controller you have?

japc commented 4 months ago

Btw, @japc how much replicas of rspamd normal\controller you have?

2 controller processes, 20 normal processes.

japc commented 4 months ago

Would increasing the controller process count improve the bayes categorization?

dragoangel commented 4 months ago

Btw, @japc how much replicas of rspamd normal\controller you have?

2 controller processes, 20 normal processes.

Just to clarify what you mean by process? 1 instance with 2 processes of controller and 20 normal child proccesses limit or this is dedicated servers(containers/pod/etc) count?

japc commented 4 months ago

Would increasing the controller process count improve the bayes categorization?

From the logs it seem to always run on the same controller, so I guess not.

Just to clarify what you mean by process? 1 instance with 2 processes of controller and 20 normal child proccesses limit or this is dedicated servers(containers/pod/etc) count?

5 rspamd servers. Each one multipurpose, with 2 controller + 20 normal each.

dragoangel commented 4 months ago

Would increasing the controller process count improve the bayes categorization?

From the logs it seem to always run on the same controller, so I guess not.

Just to clarify what you mean by process? 1 instance with 2 processes of controller and 20 normal child proccesses limit or this is dedicated servers(containers/pod/etc) count?

5 rspamd servers. Each one multipurpose, with 2 controller + 20 normal each.

Yeah, from code there is lock based on hostname, and only rspamd with matching hostname would run it. Try what Moiseev proposed: play with interval\count. I think to start you can try increasing count to x5 times, and then look how it goes.

japc commented 4 months ago

Use statistics conf instead of classifier bayes, this config is deprecated long time ago. Delete it, and read how-to's on statistics conf.

That not what I read from the documentation, or from the includes on statistics.conf.

Also just curious how you can use password without user?

With requirepass, not ACLs.

Also as you using violate-ttl expire of 10d doesn't have much sense, you can put it to 1y, as anyway keys would be evicted by lowest ttl before expiration.

I understand the rational to have a big number there, if functioning properly, but, even more extreme, I usually shrinked that down to a couple of hours to make redis usable (as part of the "a bit more aggressive bayes_expiry" I mentioned).

P.s. the only key should be at any time frame existing in your server is learned_ids (and X) of them, that's all. In short your configuration is wrong, leading to all this issues.

Don't think that's the case.

dragoangel commented 4 months ago

Don't think that's the case.

Yes, first records created without ttl, then processed and only after they get their ttl. Was wrong at that part

dragoangel commented 4 months ago

@japc you still have issues?

I updated to Rspamd 3.9.0 (meaning from current master, not yet released, build from source) and can say that I get situation that bayes expiry not quickly enough was going over bayes DB resulting in situation that new keys was added much quicker then others get TTLs set. In my case I set bayes expiry count to 15000 which mean per 1 interval 15k keys would be proccessed, and with such configuration everyting works.

You have monitoring of your bayes redis dbs like prom with grafana? It can simplify your understanding of what is going on.

P.s. To apply mentioned changes I had to add https://github.com/rspamd/rspamd/pull/4972 in my rspamd to load settings count and interval from local.d/bayes_expiry.conf

japc commented 3 months ago

@japc you still have issues? ... In my case I set bayes expiry count to 15000 which mean per 1 interval 15k keys would be proccessed, and with such configuration everyting works.

As said before that has always been my workaround. Although I used to restore those values to the defaults. after redis memory was ok, and would then end with a unusable redis after a while, but have since considered them a definitive solution and kept them with a smaller interval and larger count and things have been under control ever since.

Having those as configurable settings is great. Thank you.

dragoangel commented 2 months ago

Faced same thing on neural module :(

2024-06-27 10:19:07 #102(controller) <jmnk3o>; lua; neural.lua:393: cannot call lock script for ANN rn_LONG_default_ujg44y7t_0 from redis: OOM command not allowed when used memory > 'maxmemory'. script: 2a40f4606d23e3d080b58a7a86d06dc0f6705056, on @user_script:1

But in neural there all keys are non expiry, so it's explainable.