Closed lukemonahantnt closed 4 weeks ago
Hello, @lukemonahantnt !
According to your limits I not sure that it would enough RAM and CPU to start redis.
If you have the same problem without RAM limits then please take a look how many free RAM your have on your server (just run top
in linux).
I on 99% sure that you haven't enough free RAM on your server or you provided super small RAM limits to start Redis. Our recommendation to use a node with 8+ GB RAM
Hi @ikheifets-splunk :
My node has 20GB memory and plenty of that is still free, even during this condition.
total used free shared buff/cache available
Mem: 19832 8476 9243 3 2347 11355
Swap: 0 0 0
However, the redis container is still killed by the OOM killer on every startup. I am assuming due to limits in the container spec (?).
kernel: Memory cgroup out of memory: Killed process 1643372 (redis-server) total-vm:351508kB, anon-rss:191088kB, file-rss:0kB, shmem-rss:0kB, UI>
The container limits I have posted are what comes when installing it via the SC4SNMP helm chart, hence I'm trying to increase them as they do seem quite small.
I appear to hit the out-of-memory issue when adding some inventory items that need a large walk. Removing these inventory items (and their walk profile) restabilises.
I appear to hit the out-of-memory issue when adding some inventory items that need a large walk. Removing these inventory items (and their walk profile) restabilises.
@lukemonahantnt How much big your inventory?
In general we using redis as backend for celery queue that running periodic tasks, if you have really have huge inventory and you haven't too enough workers (nodes) to consume queue then redis might be out of the memory.
My proposition increase polling / walk interval and it will help to keep queue in redis smaller. Let's start with 1h polling interval, and if it's okay decrease it. but If it's not okay increase polling interval, just use more node in your cluster it will be consume your redis queue much faster and it would be impossible get of out of the queue memory
Thanks @ikheifets-splunk
How much big your inventory?
Quite large (I think): 600+ items total. However it has been stable until adding just 2 F5 BIGIP devices with a custom walk profile that included F5-BIGIP-SYSTEM-MIB and F5-BIGIP-LOCAL-MIB.
increase polling / walk interval
Walk interval has been set at 4 hours and polling interval 5 minutes for all items. A polling interval of 1 hour would make the metrics not really very useful for monitoring.
just use more node in your cluster it will be consume your redis queue much faster and it would be impossible get of out of the queue memory
I can more easily add poller worker instances on the single node. I will increase that for now to assist with keeping queue down.
Is there any way to monitor the queue length? I assume what you are saying is the task queue is getting too long as I am adding tasks faster than consuming them?
Walk interval has been set at 4 hours and polling interval 5 minutes for all items. A polling interval of 1 hour would make the metrics not really very useful for monitoring.
If it's possible to increase from 5m to 15m or 25m just experiment with that. if such interval big for you then you need to run more workers and guide you can find here. Most helpful will be that part.
Is there any way to monitor the queue length? I assume what you are saying is the task queue is getting too long as I am adding tasks faster than consuming them?
We have monitoring dashboard , it's not showing queue length, but you can check for each your SNMP device that polling and walk running correctly. If you really need to get this queue, just go in your redis container and check it, it's most simple way
@lukemonahantnt if we will summarise:
P.S. if it's needed I will be available for 1 hour long call tomorrow 3-9PM CET, if this time okay send me invite on email
@ikheifets-splunk Thanks: timezone was completely wrong for me, but appreciated anyway.
Will close this issue for now. I am tuning down as far as possible the number of SNMP poll/walk tasks and OIDs per task, and will see if I can get more node resources.
A great future feature would be to add telemetry around the task queue: current length, task wait time etc.
I started experiencing an out-of-memory error, causing the splunk-connect-for-snmp-redis-master-0 to enter a crash loop. It was killed again by the OOM killer every time.
I increased the limits for this pod in values.yaml, as per: https://splunk.github.io/splunk-connect-for-snmp/main/configuration/deployment-configuration/#shared-values
However there is no effect on the running pod:
Patching the container after creation was not possible. The only way to solve my OOM crash was to uninstall, remove the redis PVC and PV, and then reinstall.