Closed AxFrancois closed 2 months ago
If you're utilizing uWSGI on K8s it's strongly advised that you review the concurrency and prefetch-multiplier settings that are detailed here: https://docs.nautobot.com/projects/core/en/stable/user-guide/administration/guides/celery-queues/. If your concurrency is set too high for the resource requests and limits then you'll have a large amount of memory taken by all the worker processes.
Hi,
Thank you for your response @jdrew82 . I reviewed the linked documentation and, as suggested, attempted to reduce the concurrency setting for Celery instances. I applied the following configuration in the Helm chart:
celery:
# -- Enables the default Celery Worker for Nautobot
enabled: true
# -- [[ref](https://docs.nautobot.com/projects/core/en/stable/administration/celery-queues/#concurrency-setting)]Concurrency Setting (0: use the default which is the number of CPUs available on your system)<sup>[2](#notes)</sup>
concurrency: 16
I also tested with concurrency values of 8 and 2. However, these adjustments had no impact on the memory usage of the nautobot-supersecretname-default pod.
Here's a little more detail that might help with troubleshooting.
When deploying Nautobot using the default Helm chart (with the replicaCount
reduced to 1 and no other changes to resource requests, limits, or configuration), most of the pods start successfully, except for the Nautobot application pod (nautobot-supersecretname-default
). The pod consistently encounters OOMKilled
errors. Below are the logs and resource details:
kubectl get pods
NAME READY STATUS RESTARTS AGE
nautobot-supersecretname-celery-beat-c57dddd89-8pfh6 1/1 Running 0 23h
nautobot-supersecretname-celery-default-bb8697c45-62kgj 1/1 Running 0 23h
nautobot-supersecretname-celery-default-bb8697c45-7259m 1/1 Running 0 23h
nautobot-supersecretname-default-67b55dd56c-rjtnk 0/1 OOMKilled 4 (51s ago) 2m40s
nautobot-supersecretname-postgresql-0 1/1 Running 0 23h
nautobot-supersecretname-redis-master-0 1/1 Running 0 23h
kubectl logs nautobot-supersecretname-default-67b55dd56c-rjtnk -p
Defaulted container "nautobot" out of: nautobot, nautobot-init (init)
[uWSGI] getting INI configuration from /opt/nautobot/uwsgi.ini
[uwsgi-static] added mapping for /static => /opt/nautobot/static
*** Starting uWSGI 2.0.23 (64bit) on [Tue Aug 27 13:22:21 2024] ***
compiled with version: 12.2.0 on 08 July 2024 22:01:50
os: Linux-5.15.0-106.131.4.el9uek.x86_64 #2 SMP Fri Sep 22 18:04:33 PDT 2023
nodename: nautobot-supersecretname-default-67b55dd56c-rjtnk
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /opt/nautobot
detected binary path: /usr/local/bin/python3.11
!!! no internal routing support, rebuild with pcre support !!!
your memory page size is 4096 bytes
detected max file descriptor number: 1073741816
building mime-types dictionary from file /etc/mime.types...1545 entry found
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on 0.0.0.0:8080 fd 11
uWSGI http bound on 0.0.0.0:8443 fd 12
uwsgi socket 0 bound to TCP address 127.0.0.1:44813 (port auto-assigned) fd 10
Python version: 3.11.9 (main, Jul 3 2024, 00:12:48) [GCC 12.2.0]
--- Python VM already initialized ---
Python main interpreter initialized at 0x7f761be78658
python threads support enabled
your server socket listen backlog is limited to 128 connections
your mercy for graceful operations on workers is 60 seconds
mapped 333504 bytes (325 KB) for 6 cores
*** Operational MODE: preforking+threaded ***
13:22:21.825 INFO nautobot :
Nautobot initialized!
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x7f761be78658 pid: 1 (default app)
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 9, cores: 2)
spawned uWSGI worker 2 (pid: 10, cores: 2)
13:22:21.858 INFO nautobot.core.wsgi :
Closing existing DB and cache connections on worker 1 after uWSGI forked ...
spawned uWSGI worker 3 (pid: 12, cores: 2)
13:22:21.860 INFO nautobot.core.wsgi :
Closing existing DB and cache connections on worker 2 after uWSGI forked ...
13:22:21.861 INFO nautobot.core.wsgi :
Closing existing DB and cache connections on worker 3 after uWSGI forked ...
spawned uWSGI http 1 (pid: 14)
kubectl describe pod nautobot-supersecretname-default-67b55dd56c-rjtnk
Name: nautobot-supersecretname-default-67b55dd56c-rjtnk
Namespace: nautobot-v2-1-3
Priority: 0
Service Account: nautobot-supersecretname
Node: si-kubenode05/10.255.254.57
Start Time: Tue, 27 Aug 2024 15:18:13 +0200
Labels: app.kubernetes.io/component=nautobot-default
app.kubernetes.io/instance=nautobot-supersecretname
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nautobot
app.kubernetes.io/version=2.2.7
helm.sh/chart=nautobot-2.1.3
pod-template-hash=67b55dd56c
Annotations: <none>
Status: Running
SeccompProfile: RuntimeDefault
IP: 10.244.165.217
IPs:
IP: 10.244.165.217
Controlled By: ReplicaSet/nautobot-supersecretname-default-67b55dd56c
Init Containers:
nautobot-init:
Container ID: containerd://3d0968b366e9b627cd1bb1f26e4b05cb1ab462a3e0286948ab5a0106c8a8a7d4
Image: ghcr.io/nautobot/nautobot:2.2.7-py3.11
Image ID: ghcr.io/nautobot/nautobot@sha256:6b936558dd5e7368b2556eae4725f053346cfa802835f5076da14281d29c5f03
Port: <none>
Host Port: <none>
Args:
echo
Nautobot Initialization Done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 27 Aug 2024 15:18:14 +0200
Finished: Tue, 27 Aug 2024 15:18:45 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 8704M
Requests:
cpu: 300m
memory: 1Gi
Environment Variables from:
nautobot-supersecretname-env ConfigMap Optional: false
nautobot-supersecretname-env Secret Optional: false
Environment:
NAUTOBOT_DB_PASSWORD: <set to the key 'password' in secret 'nautobot-supersecretname-postgresql'> Optional: false
NAUTOBOT_REDIS_PASSWORD: <set to the key 'redis-password' in secret 'nautobot-supersecretname-redis'> Optional: false
NAUTOBOT_SECRET_KEY: <set to the key 'NAUTOBOT_SECRET_KEY' in secret 'nautobot-supersecretname-env'> Optional: false
Mounts:
/opt/nautobot/git from git-repos (rw)
/opt/nautobot/static from nautobot-static (rw)
/opt/nautobot/uwsgi.ini from nautobot-config (rw,path="uwsgi.ini")
Containers:
nautobot:
Container ID: containerd://f6f18a989142bc54aeafed5f865fb9ccd58a83cd718d8eeb1787d12b8a428b48
Image: ghcr.io/nautobot/nautobot:2.2.7-py3.11
Image ID: ghcr.io/nautobot/nautobot@sha256:6b936558dd5e7368b2556eae4725f053346cfa802835f5076da14281d29c5f03
Ports: 8443/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
Command:
nautobot-server
start
--ini
/opt/nautobot/uwsgi.ini
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 27 Aug 2024 15:25:04 +0200
Finished: Tue, 27 Aug 2024 15:25:13 +0200
Ready: False
Restart Count: 6
Limits:
cpu: 1
memory: 8704M
Requests:
cpu: 300m
memory: 1Gi
Liveness: exec [bash -c nautobot-server health_check] delay=15s timeout=10s period=15s #success=1 #failure=3
Readiness: http-get http://:http/health/ delay=15s timeout=5s period=10s #success=1 #failure=3
Environment Variables from:
nautobot-supersecretname-env ConfigMap Optional: false
nautobot-supersecretname-env Secret Optional: false
Environment:
NAUTOBOT_K8S_COMPONENT: nautobot-default
NAUTOBOT_DB_PASSWORD: <set to the key 'password' in secret 'nautobot-supersecretname-postgresql'> Optional: false
NAUTOBOT_REDIS_PASSWORD: <set to the key 'redis-password' in secret 'nautobot-supersecretname-redis'> Optional: false
NAUTOBOT_SECRET_KEY: <set to the key 'NAUTOBOT_SECRET_KEY' in secret 'nautobot-supersecretname-env'> Optional: false
Mounts:
/opt/nautobot/git from git-repos (rw)
/opt/nautobot/media from nautobot-media (rw)
/opt/nautobot/static from nautobot-static (rw)
/opt/nautobot/uwsgi.ini from nautobot-config (rw,path="uwsgi.ini")
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nautobot-static:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
nautobot-media:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
git-repos:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
nautobot-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nautobot-supersecretname-config
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 7m26s kubelet Pulling image "ghcr.io/nautobot/nautobot:2.2.7-py3.11"
Normal Pulled 7m25s kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 481ms (481ms including waiting)
Normal Created 7m25s kubelet Created container nautobot-init
Normal Started 7m25s kubelet Started container nautobot-init
Normal Pulled 6m54s kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 488ms (488ms including waiting)
Normal Pulled 6m44s kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 480ms (480ms including waiting)
Normal Pulled 6m26s kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 456ms (456ms including waiting)
Normal Pulling 5m46s (x4 over 6m54s) kubelet Pulling image "ghcr.io/nautobot/nautobot:2.2.7-py3.11"
Normal Created 5m46s (x4 over 6m53s) kubelet Created container nautobot
Normal Pulled 5m46s kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:2.2.7-py3.11" in 496ms (496ms including waiting)
Normal Started 5m45s (x4 over 6m53s) kubelet Started container nautobot
Warning BackOff 2m18s (x20 over 6m36s) kubelet Back-off restarting failed container nautobot in pod nautobot-supersecretname-default-67b55dd56c-rjtnk_nautobot-v2-1-3(58a5f0a1-0d46-4c0c-8279-03e75e05d49e)
With the default Helm configuration, the nautobot-default pod crashes with a memory limit of 8704M (~8308 MiB). Removing the memory limit allows the pod to start, but it then consumes significantly more memory (exceeding 8308 MiB). Here are the current result :
kubectl get pods
NAME READY STATUS RESTARTS AGE
nautobot-supersecretname-celery-beat-c57dddd89-8pfh6 1/1 Running 0 24h
nautobot-supersecretname-celery-default-bb8697c45-62kgj 1/1 Running 0 24h
nautobot-supersecretname-celery-default-bb8697c45-7259m 1/1 Running 0 24h
nautobot-supersecretname-default-66f6b4675f-rtthp 1/1 Running 0 61s
nautobot-supersecretname-postgresql-0 1/1 Running 0 24h
nautobot-supersecretname-redis-master-0 1/1 Running 0 24h
kubectl top pod
NAME CPU(cores) MEMORY(bytes)
nautobot-supersecretname-celery-beat-c57dddd89-pmvg8 1m 178Mi
nautobot-supersecretname-celery-default-d9c866796-flx92 2m 349Mi
nautobot-supersecretname-celery-default-d9c866796-hhhfv 2m 349Mi
nautobot-supersecretname-default-66f6b4675f-rtthp 560m 8802Mi
nautobot-supersecretname-postgresql-0 18m 60Mi
nautobot-supersecretname-redis-master-0 17m 8Mi
kubectl logs nautobot-supersecretname-default-66f6b4675f-rtthp
Defaulted container "nautobot" out of: nautobot, nautobot-init (init)
[uWSGI] getting INI configuration from /opt/nautobot/uwsgi.ini
[uwsgi-static] added mapping for /static => /opt/nautobot/static
*** Starting uWSGI 2.0.23 (64bit) on [Tue Aug 27 14:12:23 2024] ***
compiled with version: 12.2.0 on 08 July 2024 22:01:50
os: Linux-5.15.0-106.131.4.el9uek.x86_64 #2 SMP Fri Sep 22 18:04:33 PDT 2023
nodename: nautobot-supersecretname-default-66f6b4675f-rtthp
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /opt/nautobot
detected binary path: /usr/local/bin/python3.11
!!! no internal routing support, rebuild with pcre support !!!
your memory page size is 4096 bytes
detected max file descriptor number: 1073741816
building mime-types dictionary from file /etc/mime.types...1545 entry found
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on 0.0.0.0:8080 fd 11
uWSGI http bound on 0.0.0.0:8443 fd 12
uwsgi socket 0 bound to TCP address 127.0.0.1:41027 (port auto-assigned) fd 10
Python version: 3.11.9 (main, Jul 3 2024, 00:12:48) [GCC 12.2.0]
--- Python VM already initialized ---
Python main interpreter initialized at 0x7fea32876658
python threads support enabled
your server socket listen backlog is limited to 128 connections
your mercy for graceful operations on workers is 60 seconds
mapped 333504 bytes (325 KB) for 6 cores
*** Operational MODE: preforking+threaded ***
14:12:23.688 INFO nautobot :
Nautobot initialized!
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x7fea32876658 pid: 1 (default app)
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 9, cores: 2)
14:12:23.722 INFO nautobot.core.wsgi :
Closing existing DB and cache connections on worker 1 after uWSGI forked ...
spawned uWSGI worker 2 (pid: 10, cores: 2)
14:12:23.724 INFO nautobot.core.wsgi :
Closing existing DB and cache connections on worker 2 after uWSGI forked ...
spawned uWSGI worker 3 (pid: 12, cores: 2)
14:12:23.726 INFO nautobot.core.wsgi :
Closing existing DB and cache connections on worker 3 after uWSGI forked ...
spawned uWSGI http 1 (pid: 14)
This memory usage is consistant and is never released :
We went on the node and determined that the involved process in the very high memory usage are the nautobot-server process.
We are concerned about the high memory usage of the Nautobot pod and would like to know if this is expected behavior. If so, could you please clarify why it differs so significantly from the low resource requirements mentioned in the documentation? Thanks again for the help,
Best regards.
@AxFrancois Thank you for the information. That helps clear things up a bit. First, I want to start with stating that the limits and requests that you see in the Helm chart are just examples and are NOT recommended values. As every situation is unique and your usage patterns will vary from others, it's going to be up to you to determine what are reasonable values for those. As for why so much memory is used, I'd have to assume it's due to the fact that you have your concurrency set to 16. That'd be the nautobot-server process that you're seeing as the command to start Celery is a nautobot-server command. Each worker process will reserve a bit of memory when they first start.
In addition, in our standard deployments on a VM with all services residing on a single VM we suggest resources of 4 vCPU and 16GB of memory. That is for a basic installation with very little data in it. As you start to add in more data and use the application more your resources will need to be adjusted. Also, bear in mind that depending upon which Apps you have installed the resource usage can be high too. For example, Golden Config and SSoT can both use a lot of memory.
Hi,
Thanks again for your response. I finally figured out what the issue was!
I kept searching for the cause of the high memory usage with uWSGI and found these two issues:
In fact, right at the start of the logs, the pod reported: "detected max file descriptor number: 1073741816." The issues recommended setting this value to 1048576. Following the suggestions, I updated the Nautobot container command to the following:
command:
- "nautobot-server"
- "start"
- "--ini"
- "/opt/nautobot/uwsgi.ini"
- "--max-fd"
- "1048576"
After upgrading the chart, the memory usage of the nautobot-supersecretname-default
pod dropped significantly right away!
NAME CPU(cores) MEMORY(bytes)
nautobot-supersecretname-celery-beat-6cbd7f6d67-rd94b 1m 179Mi
nautobot-supersecretname-celery-default-74fcddfd55-67btq 2m 1299Mi
nautobot-supersecretname-celery-default-74fcddfd55-k6cpb 2m 1305Mi
nautobot-supersecretname-default-7cc779466f-79jbh 387m 635Mi
nautobot-supersecretname-default-7cc779466f-stqtk 407m 432Mi
nautobot-supersecretname-postgresql-0 14m 59Mi
nautobot-supersecretname-redis-master-0 15m 8Mi
nautobot-supersecretname-default
uses less than 700Mi instead of the previous 8/9 Gi !
Thanks again for the help,
Best regards.
Good morning everyone,
I recently began working on a proof of concept (PoC) for Nautobot and was tasked with installing it on our Kubernetes cluster. Unfortunately, we encountered an issue with the Nautobot container being OOMKilled by Kubernetes when starting the chart. Specifically, the containers for the Nautobot application ("default") were consistently terminated due to memory limits.
To address this, we deployed additional Kubernetes nodes with more memory, but the issue persisted. Eventually, we removed the resource limit of
8704M
, and the application finally started. However, it exceeded the previously set memory limit:We were quite surprised by this outcome. To be fair, Nautobot has now become the most memory-intensive application on our cluster. Additionally, the pods do not seem to release this memory after startup. Is this expected behavior? We were under the impression that Nautobot was a relatively lightweight application, and these results have left us puzzled.
When digging a bit deeper, we determined that the processes responsible for the high memory usage were uWSGI and its forks. Is there any way to control this? Tweaking the processes and threads parameters didn't change anything.
Any insights or suggestions would be greatly appreciated.
Best regards.