Nautobot container get OOMKilled

AxFrancois commented 2 months ago

Good morning everyone,

I recently began working on a proof of concept (PoC) for Nautobot and was tasked with installing it on our Kubernetes cluster. Unfortunately, we encountered an issue with the Nautobot container being OOMKilled by Kubernetes when starting the chart. Specifically, the containers for the Nautobot application ("default") were consistently terminated due to memory limits.

To address this, we deployed additional Kubernetes nodes with more memory, but the issue persisted. Eventually, we removed the resource limit of 8704M, and the application finally started. However, it exceeded the previously set memory limit:

kubectl top pod                                        
NAME                                               CPU(cores)   MEMORY(bytes)   
nautobot-***-celery-beat-c57dddd89-8pfh6      1m           178Mi           
nautobot-***-celery-default-bb8697c45-62kgj   2m           1290Mi          
nautobot-***-celery-default-bb8697c45-7259m   2m           1319Mi          
nautobot-***-default-66f6b4675f-zl42k         378m         8520Mi          
nautobot-***-postgresql-0                     13m          63Mi            
nautobot-***-redis-master-0                   21m          8Mi

We were quite surprised by this outcome. To be fair, Nautobot has now become the most memory-intensive application on our cluster. Additionally, the pods do not seem to release this memory after startup. Is this expected behavior? We were under the impression that Nautobot was a relatively lightweight application, and these results have left us puzzled.

The default Nautobot application requests:
300m CPU (Per Pod) * 2 Pods
1.2G RAM (Per Pod) * 2 Pods
from official website

When digging a bit deeper, we determined that the processes responsible for the high memory usage were uWSGI and its forks. Is there any way to control this? Tweaking the processes and threads parameters didn't change anything.

Any insights or suggestions would be greatly appreciated.

Best regards.

jdrew82 commented 2 months ago

If you're utilizing uWSGI on K8s it's strongly advised that you review the concurrency and prefetch-multiplier settings that are detailed here: https://docs.nautobot.com/projects/core/en/stable/user-guide/administration/guides/celery-queues/. If your concurrency is set too high for the resource requests and limits then you'll have a large amount of memory taken by all the worker processes.

AxFrancois commented 2 months ago

Hi,

Thank you for your response @jdrew82 . I reviewed the linked documentation and, as suggested, attempted to reduce the concurrency setting for Celery instances. I applied the following configuration in the Helm chart:

celery:
  # -- Enables the default Celery Worker for Nautobot
  enabled: true

  # -- [[ref](https://docs.nautobot.com/projects/core/en/stable/administration/celery-queues/#concurrency-setting)]Concurrency Setting (0: use the default which is the number of CPUs available on your system)<sup>[2](#notes)</sup>
  concurrency: 16

I also tested with concurrency values of 8 and 2. However, these adjustments had no impact on the memory usage of the nautobot-supersecretname-default pod.

Here's a little more detail that might help with troubleshooting.

When deploying Nautobot using the default Helm chart (with the replicaCountreduced to 1 and no other changes to resource requests, limits, or configuration), most of the pods start successfully, except for the Nautobot application pod (nautobot-supersecretname-default). The pod consistently encounters OOMKillederrors. Below are the logs and resource details:

kubectl get pods
NAME                                               READY   STATUS      RESTARTS      AGE
nautobot-supersecretname-celery-beat-c57dddd89-8pfh6      1/1     Running     0             23h
nautobot-supersecretname-celery-default-bb8697c45-62kgj   1/1     Running     0             23h
nautobot-supersecretname-celery-default-bb8697c45-7259m   1/1     Running     0             23h
nautobot-supersecretname-default-67b55dd56c-rjtnk         0/1     OOMKilled   4 (51s ago)   2m40s
nautobot-supersecretname-postgresql-0                     1/1     Running     0             23h
nautobot-supersecretname-redis-master-0                   1/1     Running     0             23h

kubectl logs nautobot-supersecretname-default-67b55dd56c-rjtnk -p
Defaulted container "nautobot" out of: nautobot, nautobot-init (init)
[uWSGI] getting INI configuration from /opt/nautobot/uwsgi.ini
[uwsgi-static] added mapping for /static => /opt/nautobot/static
*** Starting uWSGI 2.0.23 (64bit) on [Tue Aug 27 13:22:21 2024] ***
compiled with version: 12.2.0 on 08 July 2024 22:01:50
os: Linux-5.15.0-106.131.4.el9uek.x86_64 #2 SMP Fri Sep 22 18:04:33 PDT 2023
nodename: nautobot-supersecretname-default-67b55dd56c-rjtnk
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /opt/nautobot
detected binary path: /usr/local/bin/python3.11
!!! no internal routing support, rebuild with pcre support !!!
your memory page size is 4096 bytes
detected max file descriptor number: 1073741816
building mime-types dictionary from file /etc/mime.types...1545 entry found
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on 0.0.0.0:8080 fd 11
uWSGI http bound on 0.0.0.0:8443 fd 12
uwsgi socket 0 bound to TCP address 127.0.0.1:44813 (port auto-assigned) fd 10
Python version: 3.11.9 (main, Jul  3 2024, 00:12:48) [GCC 12.2.0]
--- Python VM already initialized ---
Python main interpreter initialized at 0x7f761be78658
python threads support enabled
your server socket listen backlog is limited to 128 connections
your mercy for graceful operations on workers is 60 seconds
mapped 333504 bytes (325 KB) for 6 cores
*** Operational MODE: preforking+threaded ***
13:22:21.825 INFO    nautobot :
  Nautobot initialized!
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x7f761be78658 pid: 1 (default app)
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 9, cores: 2)
spawned uWSGI worker 2 (pid: 10, cores: 2)
13:22:21.858 INFO    nautobot.core.wsgi :
  Closing existing DB and cache connections on worker 1 after uWSGI forked ...
spawned uWSGI worker 3 (pid: 12, cores: 2)
13:22:21.860 INFO    nautobot.core.wsgi :
  Closing existing DB and cache connections on worker 2 after uWSGI forked ...
13:22:21.861 INFO    nautobot.core.wsgi :
  Closing existing DB and cache connections on worker 3 after uWSGI forked ...
spawned uWSGI http 1 (pid: 14)

kubectl describe pod nautobot-supersecretname-default-67b55dd56c-rjtnk 
Name:             nautobot-supersecretname-default-67b55dd56c-rjtnk
Namespace:        nautobot-v2-1-3
Priority:         0
Service Account:  nautobot-supersecretname
Node:             si-kubenode05/10.255.254.57
Start Time:       Tue, 27 Aug 2024 15:18:13 +0200
Labels:           app.kubernetes.io/component=nautobot-default
                  app.kubernetes.io/instance=nautobot-supersecretname
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=nautobot
                  app.kubernetes.io/version=2.2.7
                  helm.sh/chart=nautobot-2.1.3
                  pod-template-hash=67b55dd56c
Annotations:      <none>
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.244.165.217
IPs:
  IP:           10.244.165.217
Controlled By:  ReplicaSet/nautobot-supersecretname-default-67b55dd56c
Init Containers:
  nautobot-init:
    Container ID:  containerd://3d0968b366e9b627cd1bb1f26e4b05cb1ab462a3e0286948ab5a0106c8a8a7d4
    Image:         ghcr.io/nautobot/nautobot:2.2.7-py3.11
    Image ID:      ghcr.io/nautobot/nautobot@sha256:6b936558dd5e7368b2556eae4725f053346cfa802835f5076da14281d29c5f03
    Port:          <none>
    Host Port:     <none>
    Args:
      echo
      Nautobot Initialization Done
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 27 Aug 2024 15:18:14 +0200
      Finished:     Tue, 27 Aug 2024 15:18:45 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  8704M
    Requests:
      cpu:     300m
      memory:  1Gi
    Environment Variables from:
      nautobot-supersecretname-env  ConfigMap  Optional: false
      nautobot-supersecretname-env  Secret     Optional: false
    Environment:
      NAUTOBOT_DB_PASSWORD:     <set to the key 'password' in secret 'nautobot-supersecretname-postgresql'>      Optional: false
      NAUTOBOT_REDIS_PASSWORD:  <set to the key 'redis-password' in secret 'nautobot-supersecretname-redis'>     Optional: false
      NAUTOBOT_SECRET_KEY:      <set to the key 'NAUTOBOT_SECRET_KEY' in secret 'nautobot-supersecretname-env'>  Optional: false
    Mounts:
      /opt/nautobot/git from git-repos (rw)
      /opt/nautobot/static from nautobot-static (rw)
      /opt/nautobot/uwsgi.ini from nautobot-config (rw,path="uwsgi.ini")
Containers:
  nautobot:
    Container ID:  containerd://f6f18a989142bc54aeafed5f865fb9ccd58a83cd718d8eeb1787d12b8a428b48
    Image:         ghcr.io/nautobot/nautobot:2.2.7-py3.11
    Image ID:      ghcr.io/nautobot/nautobot@sha256:6b936558dd5e7368b2556eae4725f053346cfa802835f5076da14281d29c5f03
    Ports:         8443/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      nautobot-server
      start
      --ini
      /opt/nautobot/uwsgi.ini
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 27 Aug 2024 15:25:04 +0200
      Finished:     Tue, 27 Aug 2024 15:25:13 +0200
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     1
      memory:  8704M
    Requests:
      cpu:      300m
      memory:   1Gi
    Liveness:   exec [bash -c nautobot-server health_check] delay=15s timeout=10s period=15s #success=1 #failure=3
    Readiness:  http-get http://:http/health/ delay=15s timeout=5s period=10s #success=1 #failure=3
    Environment Variables from:
      nautobot-supersecretname-env  ConfigMap  Optional: false
      nautobot-supersecretname-env  Secret     Optional: false
    Environment:
      NAUTOBOT_K8S_COMPONENT:   nautobot-default
      NAUTOBOT_DB_PASSWORD:     <set to the key 'password' in secret 'nautobot-supersecretname-postgresql'>      Optional: false
      NAUTOBOT_REDIS_PASSWORD:  <set to the key 'redis-password' in secret 'nautobot-supersecretname-redis'>     Optional: false
      NAUTOBOT_SECRET_KEY:      <set to the key 'NAUTOBOT_SECRET_KEY' in secret 'nautobot-supersecretname-env'>  Optional: false
    Mounts:
      /opt/nautobot/git from git-repos (rw)
      /opt/nautobot/media from nautobot-media (rw)
      /opt/nautobot/static from nautobot-static (rw)
      /opt/nautobot/uwsgi.ini from nautobot-config (rw,path="uwsgi.ini")
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  nautobot-static:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  nautobot-media:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  git-repos:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  nautobot-config:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        nautobot-supersecretname-config
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  7m26s                   kubelet  Pulling image "ghcr.io/nautobot/nautobot:2.2.7-py3.11"
  Normal   Pulled   7m25s                   kubelet  Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 481ms (481ms including waiting)
  Normal   Created  7m25s                   kubelet  Created container nautobot-init
  Normal   Started  7m25s                   kubelet  Started container nautobot-init
  Normal   Pulled   6m54s                   kubelet  Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 488ms (488ms including waiting)
  Normal   Pulled   6m44s                   kubelet  Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 480ms (480ms including waiting)
  Normal   Pulled   6m26s                   kubelet  Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 456ms (456ms including waiting)
  Normal   Pulling  5m46s (x4 over 6m54s)   kubelet  Pulling image "ghcr.io/nautobot/nautobot:2.2.7-py3.11"
  Normal   Created  5m46s (x4 over 6m53s)   kubelet  Created container nautobot
  Normal   Pulled   5m46s                   kubelet  Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 496ms (496ms including waiting)
  Normal   Started  5m45s (x4 over 6m53s)   kubelet  Started container nautobot
  Warning  BackOff  2m18s (x20 over 6m36s)  kubelet  Back-off restarting failed container nautobot in pod nautobot-supersecretname-default-67b55dd56c-rjtnk_nautobot-v2-1-3(58a5f0a1-0d46-4c0c-8279-03e75e05d49e)

With the default Helm configuration, the nautobot-default pod crashes with a memory limit of 8704M (~8308 MiB). Removing the memory limit allows the pod to start, but it then consumes significantly more memory (exceeding 8308 MiB). Here are the current result :

kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
nautobot-supersecretname-celery-beat-c57dddd89-8pfh6      1/1     Running   0          24h
nautobot-supersecretname-celery-default-bb8697c45-62kgj   1/1     Running   0          24h
nautobot-supersecretname-celery-default-bb8697c45-7259m   1/1     Running   0          24h
nautobot-supersecretname-default-66f6b4675f-rtthp         1/1     Running   0          61s
nautobot-supersecretname-postgresql-0                     1/1     Running   0          24h
nautobot-supersecretname-redis-master-0                   1/1     Running   0          24h

kubectl top pod 
NAME                                               CPU(cores)   MEMORY(bytes)   
nautobot-supersecretname-celery-beat-c57dddd89-pmvg8      1m           178Mi           
nautobot-supersecretname-celery-default-d9c866796-flx92   2m           349Mi           
nautobot-supersecretname-celery-default-d9c866796-hhhfv   2m           349Mi           
nautobot-supersecretname-default-66f6b4675f-rtthp         560m         8802Mi          
nautobot-supersecretname-postgresql-0                     18m          60Mi            
nautobot-supersecretname-redis-master-0                   17m          8Mi

kubectl logs nautobot-supersecretname-default-66f6b4675f-rtthp
Defaulted container "nautobot" out of: nautobot, nautobot-init (init)
[uWSGI] getting INI configuration from /opt/nautobot/uwsgi.ini
[uwsgi-static] added mapping for /static => /opt/nautobot/static
*** Starting uWSGI 2.0.23 (64bit) on [Tue Aug 27 14:12:23 2024] ***
compiled with version: 12.2.0 on 08 July 2024 22:01:50
os: Linux-5.15.0-106.131.4.el9uek.x86_64 #2 SMP Fri Sep 22 18:04:33 PDT 2023
nodename: nautobot-supersecretname-default-66f6b4675f-rtthp
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /opt/nautobot
detected binary path: /usr/local/bin/python3.11
!!! no internal routing support, rebuild with pcre support !!!
your memory page size is 4096 bytes
detected max file descriptor number: 1073741816
building mime-types dictionary from file /etc/mime.types...1545 entry found
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on 0.0.0.0:8080 fd 11
uWSGI http bound on 0.0.0.0:8443 fd 12
uwsgi socket 0 bound to TCP address 127.0.0.1:41027 (port auto-assigned) fd 10
Python version: 3.11.9 (main, Jul  3 2024, 00:12:48) [GCC 12.2.0]
--- Python VM already initialized ---
Python main interpreter initialized at 0x7fea32876658
python threads support enabled
your server socket listen backlog is limited to 128 connections
your mercy for graceful operations on workers is 60 seconds
mapped 333504 bytes (325 KB) for 6 cores
*** Operational MODE: preforking+threaded ***
14:12:23.688 INFO    nautobot :
  Nautobot initialized!
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x7fea32876658 pid: 1 (default app)
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 9, cores: 2)
14:12:23.722 INFO    nautobot.core.wsgi :
  Closing existing DB and cache connections on worker 1 after uWSGI forked ...
spawned uWSGI worker 2 (pid: 10, cores: 2)
14:12:23.724 INFO    nautobot.core.wsgi :
  Closing existing DB and cache connections on worker 2 after uWSGI forked ...
spawned uWSGI worker 3 (pid: 12, cores: 2)
14:12:23.726 INFO    nautobot.core.wsgi :
  Closing existing DB and cache connections on worker 3 after uWSGI forked ...
spawned uWSGI http 1 (pid: 14)

This memory usage is consistant and is never released : Grafana Screenshot

We went on the node and determined that the involved process in the very high memory usage are the nautobot-server process.

We are concerned about the high memory usage of the Nautobot pod and would like to know if this is expected behavior. If so, could you please clarify why it differs so significantly from the low resource requirements mentioned in the documentation? Thanks again for the help,

Best regards.

jdrew82 commented 2 months ago

@AxFrancois Thank you for the information. That helps clear things up a bit. First, I want to start with stating that the limits and requests that you see in the Helm chart are just examples and are NOT recommended values. As every situation is unique and your usage patterns will vary from others, it's going to be up to you to determine what are reasonable values for those. As for why so much memory is used, I'd have to assume it's due to the fact that you have your concurrency set to 16. That'd be the nautobot-server process that you're seeing as the command to start Celery is a nautobot-server command. Each worker process will reserve a bit of memory when they first start.

In addition, in our standard deployments on a VM with all services residing on a single VM we suggest resources of 4 vCPU and 16GB of memory. That is for a basic installation with very little data in it. As you start to add in more data and use the application more your resources will need to be adjusted. Also, bear in mind that depending upon which Apps you have installed the resource usage can be high too. For example, Golden Config and SSoT can both use a lot of memory.

AxFrancois commented 2 months ago

Hi,

Thanks again for your response. I finally figured out what the issue was!

I kept searching for the cause of the high memory usage with uWSGI and found these two issues:

In fact, right at the start of the logs, the pod reported: "detected max file descriptor number: 1073741816." The issues recommended setting this value to 1048576. Following the suggestions, I updated the Nautobot container command to the following:

  command:
    - "nautobot-server"
    - "start"
    - "--ini"
    - "/opt/nautobot/uwsgi.ini"
    - "--max-fd"
    - "1048576"

After upgrading the chart, the memory usage of the nautobot-supersecretname-default pod dropped significantly right away!

NAME                                                CPU(cores)   MEMORY(bytes)   
nautobot-supersecretname-celery-beat-6cbd7f6d67-rd94b      1m           179Mi           
nautobot-supersecretname-celery-default-74fcddfd55-67btq   2m           1299Mi          
nautobot-supersecretname-celery-default-74fcddfd55-k6cpb   2m           1305Mi          
nautobot-supersecretname-default-7cc779466f-79jbh          387m         635Mi           
nautobot-supersecretname-default-7cc779466f-stqtk          407m         432Mi           
nautobot-supersecretname-postgresql-0                      14m          59Mi            
nautobot-supersecretname-redis-master-0                    15m          8Mi

nautobot-supersecretname-default uses less than 700Mi instead of the previous 8/9 Gi !

Thanks again for the help,

Best regards.

nautobot / helm-charts

Nautobot container get OOMKilled #433