Open bwmetcalf opened 2 years ago
Hey @bwmetcalf
Apologies for the late response it seems like this fell through the cracks. Are you still experiencing this? It definitely isn't normal behavior
I face the same problem. Execing into the container and running consul snapshot save backup.snap
immediately kills the container with OOMKilled and the event. Using latest Helm chart version 0.45.0
2m46s Warning Unhealthy pod/consul-consul-server-1 Readiness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task 4b135353c6edfe027d976c8b1aa1cbf685d2993f9b53d385c53615ef430f8af0 not found: not found
Hi, @bobertrublik , I'd like to reproduce this behavior. Could you estimate how frequently services are registered/deregistered in the cluster? Thanks.
@bwmetcalf , you may consider increasing the value of raft_snapshot_interval to prevent frequent write to the snapshot.
Overview of the Issue
We are running consul 1.12.0.
In one of our environments we are using argo workflows to process data. Argo spins up workflow pods to complete a task and then they exit. These workflows can be spun up rapidly (multiple times per minute) which seems to greatly increase the
Registry
Type size in the consul snapshot. This is resulting is large snapshots that are taking several minutes to restore when a consul server pod is restarted and the pod, while restoring, uses almost 9GB of memory for the snapshot snippet shown below which get OOMKilled if we don't set a high or null memory limit. Additionally, because of the time to restore, even if we don't set a limit on the consul server memory, the health check probes fail and restart the pod.For example, from an environment using argo workflows:
and we see 1000s of these pods continue to be registered in consul even though they do exist which is causing the Register type to grow with each snapshot.
The primary question is why is 9GB of memory used to restore a snapshot that is 69MB?
In terms of deregistering the completed pods, we are looking at https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/.