Open molnaromatic opened 1 year ago
Seconded. I got these stats from redis just before it got killed again. I was interrupted at 80% of scanning a /21.
(18:00) (18:10) oom-kill at 18:11
db0:keys=1,expires=0,avg_ttl=0 | db0:keys=1,expires=0,avg_ttl=0
db1:keys=47701,expires=0,avg_ttl=0 | db1:keys=47701,expires=0,avg_ttl=0
db2:keys=177456,expires=0,avg_ttl=0 | db2:keys=177456,expires=0,avg_ttl=0
db3:keys=6,expires=0,avg_ttl=0 | db3:keys=5,expires=0,avg_ttl=0
db4:keys=100,expires=0,avg_ttl=0 | db4:keys=99,expires=0,avg_ttl=0
db5:keys=17,expires=0,avg_ttl=0 | db5:keys=17,expires=0,avg_ttl=0
db6:keys=17,expires=0,avg_ttl=0 | db6:keys=17,expires=0,avg_ttl=0
db7:keys=456,expires=0,avg_ttl=0 | db7:keys=99,expires=0,avg_ttl=0
db8:keys=99,expires=0,avg_ttl=0 | db8:keys=17,expires=0,avg_ttl=0
db9:keys=10344,expires=0,avg_ttl=0 | db9:keys=14963,expires=0,avg_ttl=0
db10:keys=5726,expires=0,avg_ttl=0 | --------------------------------------------
db11:keys=17,expires=0,avg_ttl=0 | db11:keys=17,expires=0,avg_ttl=0
db12:keys=99,expires=0,avg_ttl=0 | db12:keys=17,expires=0,avg_ttl=0
db13:keys=10342,expires=0,avg_ttl=0 | db13:keys=15038,expires=0,avg_ttl=0
db14:keys=98,expires=0,avg_ttl=0 | db14:keys=5501,expires=0,avg_ttl=0
db15:keys=75,expires=0,avg_ttl=0 | db15:keys=203,expires=0,avg_ttl=0
db16:keys=497,expires=0,avg_ttl=0 | db16:keys=5682,expires=0,avg_ttl=0
---------------------------------------------------------| db17:keys=99,expires=0,avg_ttl=0
db18:keys=99,expires=0,avg_ttl=0 | db18:keys=99,expires=0,avg_ttl=0
db19:keys=5657,expires=0,avg_ttl=0 | db19:keys=8716,expires=0,avg_ttl=0
db20:keys=479,expires=0,avg_ttl=0 | db20:keys=6856,expires=0,avg_ttl=0
db21:keys=17,expires=0,avg_ttl=0 | db21:keys=17,expires=0,avg_ttl=0
db22:keys=17,expires=0,avg_ttl=0 | db22:keys=17,expires=0,avg_ttl=0
db23:keys=202,expires=0,avg_ttl=0 | db23:keys=99,expires=0,avg_ttl=0
db24:keys=17,expires=0,avg_ttl=0 | db24:keys=17,expires=0,avg_ttl=0
Okay. Been looking at it some more, and I saw a lot of memory usage, even as redis-server had been killed a moment ago.
I dropped the db that seemed to be transient:
-server_time_usec:1709405329080351
-uptime_in_seconds:588
+server_time_usec:1709406156005681
+uptime_in_seconds:1415
...
# Memory
-used_memory:3344099096
-used_memory_human:3.11G
-used_memory_rss:3001507840
-used_memory_rss_human:2.80G
+used_memory:171441592
+used_memory_human:163.50M
+used_memory_rss:194514944
+used_memory_rss_human:185.50M
...
# Keyspace
db0:keys=1,expires=0,avg_ttl=0
db1:keys=177456,expires=0,avg_ttl=0
-db6:keys=3468,expires=0,avg_ttl=0
With just 3400 entries in db6, it was apparently using those 2GB.
The entries consisted of:
$ sudo docker exec -it greenbone-community-container_redis-server_1 redis-cli -s /run/redis/redis.sock -n 6
redis /run/redis/redis.sock[6]> keys *
1) "Cache/node.example.com/8200/excluding_404_body/URL_/ui/vcav-bootstrap/rest/vcav-providers/config.neon"
2) "Cache/node.example.com/8200/excluding_404_body/URL_/ui/vcav-bootstrap/rest/WEB-INF/local.properties"
3) "Cache/node.example.com/8200/excluding_404_body/URL_/ui/vropspluginui/rest/services/.env.example"
4) "Cache/node.example.com/8200/excluding_404_body/URL_/ui/vcav-bootstrap/rest/vcav-providers/local.properties"
...
3466) "Cache/node.example.com/8200/excluding_404_body/URL_/ui/psc-ui/resources/core/config/databases.yml"
3467) "Cache/node.example.com/8200/excluding_404_body/URL_/ui/vcav-bootstrap/rest/vcav-providers/example.key"
3468) "Cache/node.example.com/8200/excluding_404_body/URL_/ui/h5-vsan/rest/proxy/service/phinx.yml"
Not sure what was in those values, but it was big enough..
Moved this issue to the ospd-openvas repository to let the @greenbone/scanner-maintainers have a look at.
Could be a duplicate of greenbone/openvas-scanner#1488 and might not even be an issue in ospd-openvas.
More reading:
Okay. I increased memory (and cpu) on the box, and now there is no excessive redis key usage anymore..
By increasing mem, I probably implemented fix no. 1 suggested at https://forum.greenbone.net/t/oom-killing-redis-on-large-scan-with-openvas/14251/2 -- "Prevent overloading the system [...]"
I guess there is still something buggy, but I have no time to allocate to look into this deeper at the moment. Thanks for the suggested links!
In the mean time I've gotten a bunch of OOMs again.
Here's a sheet of mem usage / key space and highlighted potential culprits:
It would be nice if I could see in the logs somewhere what the db<N>
was used for at that point in time.
In the mean time I tried adding some limits, and now I'm running into this when doing a single scan:
redis-server:
image: greenbone/redis-server
command:
# https://forum.greenbone.net/t/redis-oom-killed-on-one-host-scan/15722/5
- /bin/sh
- -c
- 'rm -f /run/redis/redis.sock && cat /etc/redis/redis.conf >/run/redis/redis.conf && printf "%s\n" "maxmemory 12884901888" "maxmemory-policy volatile-ttl" "maxclients 150" "tcp-keepalive 15" >>/run/redis/redis.conf && redis-server /run/redis/redis.conf'
logging:
driver: journald
restart: on-failure
volumes:
- redis_socket_vol:/run/redis/
OSPD[7] 2024-03-18 11:10:23,909: ERROR: (ospd.ospd) 570186d8-8de4-4609-a13b-0ad28f639a11: Exception Command # 1 (LRANGE internal/results 0 -1) of pipeline caused error: command not allowed when used memory > 'maxmemory'. while scanning
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 1348, in _execute_transaction
response = self.parse_response(connection, "_")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 1416, in parse_response
result = Redis.parse_response(self, connection, command_name, **options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 553, in parse_response
response = connection.read_response()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/connection.py", line 524, in read_response
raise response
redis.exceptions.ExecAbortError: Transaction discarded because of previous errors.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/ospd/ospd.py", line 584, in start_scan
self.exec_scan(scan_id)
File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/daemon.py", line 1179, in exec_scan
got_results = self.report_openvas_results(kbdb, scan_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/daemon.py", line 800, in report_openvas_results
all_results = db.get_result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/db.py", line 503, in get_result
return self._pop_list_items("internal/results")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/db.py", line 487, in _pop_list_items
return OpenvasDB.pop_list_items(self.ctx, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/db.py", line 201, in pop_list_items
results, redis_return_code = pipe.execute()
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 1478, in execute
return conn.retry.call_with_retry(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/retry.py", line 46, in call_with_retry
return do()
^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 1479, in <lambda>
lambda: execute(conn, stack, raise_on_error),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 1351, in _execute_transaction
raise errors[0][1]
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 1341, in _execute_transaction
self.parse_response(connection, "_")
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 1416, in parse_response
result = Redis.parse_response(self, connection, command_name, **options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/client.py", line 553, in parse_response
response = connection.read_response()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/redis/connection.py", line 524, in read_response
raise response
redis.exceptions.OutOfMemoryError: Command # 1 (LRANGE internal/results 0 -1) of pipeline caused error: command not allowed when used memory > 'maxmemory'.
OSPD[7] 2024-03-18 11:10:23,913: INFO: (ospd.ospd) 570186d8-8de4-4609-a13b-0ad28f639a11: Host scan got interrupted. Progress: 25, Status: RUNNING
:thinking:
Okay. It turned out the problem was indeed the "caching of web pages during CGI scanning" mentioned in oom-killing-redis-on-large-scan-with-openvas.
A HashiCorp Vault server was yielding responses of about 770kB for lots and lots of scan URLs, like /ui/vropspluginui/rest/services/getstatus/news.php
(basically everything after /ui/
). With 28000+ of those results, the redis would quickly fill up.
I did the following mini-patch:
--- greenbone-community-container_vt_data_vol/_data/http_keepalive.inc.orig 2024-03-18 15:46:31.480951508 +0100
+++ greenbone-community-container_vt_data_vol/_data/http_keepalive.inc 2024-03-18 15:52:51.764904305 +0100
@@ -726,7 +726,8 @@ function http_get_cache( port, item, hos
# Internal Server Errors (5xx)
# Too Many Requests (429)
# Request Timeout (408)
- if( res !~ "^HTTP/1\.[01] (5(0[0-9]|1[01])|4(08|29))" )
+ # Size of response must be less than 1.5*64k
+ if( res !~ "^HTTP/1\.[01] (5(0[0-9]|1[01])|4(08|29))" && strlen( res ) < 98304 )
replace_kb_item( name:"Cache/" + host + "/" + port + "/" + key + "/URL_" + item, value:res );
}
And now I finally got past that server without things breaking.
Ok. That looks good. Of course it would be better if we could make that strlen( res ) < 98304
a bit more dynamic. Responses a few hundred K large should not be a problem, but they are when we have >1000 keys cached or they're already taking up several GBs.
I did not see an easy way to query redis for the total size in a keyspace. But maybe the kbdb [1] itself could keep track of key count and value size. Thinking out loud...
[1]
Doesn't look there is something available for that yet.
Alternatively set the a TTL on these values, so that they can be purged with volatile-ttl
. Although that probably won't be sufficient seeing that the LRANGE call failed with ENOMEM even when allkeys-lru
was enabled.
(edit) If there was access to the INCRBY [2] function, we could have the script keep track of its own usage and stop inserting new items in the cache once it considered that it was too much. Trying to set/get a counter ourselves without atomicity might work, but could also prove to be off-by-a-lot. Or maybe just implement a hard memory cap in the gvm-libs per keyspace. [2] https://redis.io/commands/incrby/
This really looks like a duplicate of greenbone/openvas-scanner#1488 and should be IMHO closed as such (as this is also not a problem in ospd-openvas)
Maybe you could post your (btw. great) analysis over there to have all info collected at one place?
basically everything after /ui/
If this is the only problematic dir for this specific target system it should be also possible to exclude that whole dir in the setting Regex pattern to exclude directories from CGI scanning :
of the VT Global variable settings
(Family: Settings
, OID: 1.3.6.1.4.1.25623.1.0.12288
).
I summarized the stuff here: https://github.com/greenbone/openvas-scanner/issues/1488#issuecomment-2009522054
I'm fine with closing.
I have a relatively small installation of greenbone community containers running on docker (one VM). Only 85 targets and 8 tasks. The VM containing 6 vCPU@2.2GHz and 16GB of vRAM
When I start more then one scan in Greenbone all scan will be stopped because of the openvas crash. When the problem is happening the redis try to allocate more memory then all of the memory and swap of the VM.
The redis server print a log:
The vm.overcommit_memory is set in sysctl.conf like this:
The vm.overcommit_ration calculated by swap size
After the redis are killed, in the openvas log: