hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.29k stars 4.42k forks source link

Consul storing in snapshot complete Register history #13190

Open bwmetcalf opened 2 years ago

bwmetcalf commented 2 years ago

Overview of the Issue

We are running consul 1.12.0.

In one of our environments we are using argo workflows to process data. Argo spins up workflow pods to complete a task and then they exit. These workflows can be spun up rapidly (multiple times per minute) which seems to greatly increase the Registry Type size in the consul snapshot. This is resulting is large snapshots that are taking several minutes to restore when a consul server pod is restarted and the pod, while restoring, uses almost 9GB of memory for the snapshot snippet shown below which get OOMKilled if we don't set a high or null memory limit. Additionally, because of the time to restore, even if we don't set a limit on the consul server memory, the health check probes fail and restart the pod.

For example, from an environment using argo workflows:

$ consul snapshot inspect state.bin|grep Register
 Register                    38698      55.2MB
...
 Total                                  69MB

and we see 1000s of these pods continue to be registered in consul even though they do exist which is causing the Register type to grow with each snapshot.

The primary question is why is 9GB of memory used to restore a snapshot that is 69MB?

In terms of deregistering the completed pods, we are looking at https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/.

Amier3 commented 2 years ago

Hey @bwmetcalf

Apologies for the late response it seems like this fell through the cracks. Are you still experiencing this? It definitely isn't normal behavior

bobertrublik commented 2 years ago

I face the same problem. Execing into the container and running consul snapshot save backup.snap immediately kills the container with OOMKilled and the event. Using latest Helm chart version 0.45.0

2m46s Warning Unhealthy pod/consul-consul-server-1 Readiness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task 4b135353c6edfe027d976c8b1aa1cbf685d2993f9b53d385c53615ef430f8af0 not found: not found

huikang commented 2 years ago

Hi, @bobertrublik , I'd like to reproduce this behavior. Could you estimate how frequently services are registered/deregistered in the cluster? Thanks.

huikang commented 1 year ago

@bwmetcalf , you may consider increasing the value of raft_snapshot_interval to prevent frequent write to the snapshot.