splunk / splunk-connect-for-snmp

Splunk connect for SNMP
https://splunk.github.io/splunk-connect-for-snmp/
Apache License 2.0
35 stars 16 forks source link

Redis container limits configuration appear to have no effect #1114

Closed lukemonahantnt closed 4 weeks ago

lukemonahantnt commented 4 weeks ago

I started experiencing an out-of-memory error, causing the splunk-connect-for-snmp-redis-master-0 to enter a crash loop. It was killed again by the OOM killer every time.

I increased the limits for this pod in values.yaml, as per: https://splunk.github.io/splunk-connect-for-snmp/main/configuration/deployment-configuration/#shared-values

redis:
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 250m
      memory: 512Mi

However there is no effect on the running pod:

# kubectl describe pod splunk-connect-for-snmp-redis-master-0 -n splunk-snmp

<snip>

    Limits:
      cpu:                150m
      ephemeral-storage:  2Gi
      memory:             192Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  50Mi
      memory:             128Mi

Patching the container after creation was not possible. The only way to solve my OOM crash was to uninstall, remove the redis PVC and PV, and then reinstall.

ikheifets-splunk commented 4 weeks ago

Hello, @lukemonahantnt ! According to your limits I not sure that it would enough RAM and CPU to start redis. If you have the same problem without RAM limits then please take a look how many free RAM your have on your server (just run top in linux).

I on 99% sure that you haven't enough free RAM on your server or you provided super small RAM limits to start Redis. Our recommendation to use a node with 8+ GB RAM

lukemonahantnt commented 4 weeks ago

Hi @ikheifets-splunk :

My node has 20GB memory and plenty of that is still free, even during this condition.

               total        used        free      shared  buff/cache   available
Mem:           19832        8476        9243           3        2347       11355
Swap:              0           0           0

However, the redis container is still killed by the OOM killer on every startup. I am assuming due to limits in the container spec (?).

kernel: Memory cgroup out of memory: Killed process 1643372 (redis-server) total-vm:351508kB, anon-rss:191088kB, file-rss:0kB, shmem-rss:0kB, UI>

The container limits I have posted are what comes when installing it via the SC4SNMP helm chart, hence I'm trying to increase them as they do seem quite small.

I appear to hit the out-of-memory issue when adding some inventory items that need a large walk. Removing these inventory items (and their walk profile) restabilises.

ikheifets-splunk commented 4 weeks ago

I appear to hit the out-of-memory issue when adding some inventory items that need a large walk. Removing these inventory items (and their walk profile) restabilises.

@lukemonahantnt How much big your inventory?

In general we using redis as backend for celery queue that running periodic tasks, if you have really have huge inventory and you haven't too enough workers (nodes) to consume queue then redis might be out of the memory.

My proposition increase polling / walk interval and it will help to keep queue in redis smaller. Let's start with 1h polling interval, and if it's okay decrease it. but If it's not okay increase polling interval, just use more node in your cluster it will be consume your redis queue much faster and it would be impossible get of out of the queue memory

lukemonahantnt commented 4 weeks ago

Thanks @ikheifets-splunk

How much big your inventory?

Quite large (I think): 600+ items total. However it has been stable until adding just 2 F5 BIGIP devices with a custom walk profile that included F5-BIGIP-SYSTEM-MIB and F5-BIGIP-LOCAL-MIB.

increase polling / walk interval

Walk interval has been set at 4 hours and polling interval 5 minutes for all items. A polling interval of 1 hour would make the metrics not really very useful for monitoring.

just use more node in your cluster it will be consume your redis queue much faster and it would be impossible get of out of the queue memory

I can more easily add poller worker instances on the single node. I will increase that for now to assist with keeping queue down.

Is there any way to monitor the queue length? I assume what you are saying is the task queue is getting too long as I am adding tasks faster than consuming them?

ikheifets-splunk commented 4 weeks ago

Walk interval has been set at 4 hours and polling interval 5 minutes for all items. A polling interval of 1 hour would make the metrics not really very useful for monitoring.

If it's possible to increase from 5m to 15m or 25m just experiment with that. if such interval big for you then you need to run more workers and guide you can find here. Most helpful will be that part.

Is there any way to monitor the queue length? I assume what you are saying is the task queue is getting too long as I am adding tasks faster than consuming them?

We have monitoring dashboard , it's not showing queue length, but you can check for each your SNMP device that polling and walk running correctly. If you really need to get this queue, just go in your redis container and check it, it's most simple way

ikheifets-splunk commented 4 weeks ago

@lukemonahantnt if we will summarise:

P.S. if it's needed I will be available for 1 hour long call tomorrow 3-9PM CET, if this time okay send me invite on email

lukemonahantnt commented 4 weeks ago

@ikheifets-splunk Thanks: timezone was completely wrong for me, but appreciated anyway.

Will close this issue for now. I am tuning down as far as possible the number of SNMP poll/walk tasks and OIDs per task, and will see if I can get more node resources.

A great future feature would be to add telemetry around the task queue: current length, task wait time etc.