hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

High CPU utilization after consul client agent upgrade from 1.10.1 -> 1.14.3 #17001

Open sri4kanne opened 1 year ago

sri4kanne commented 1 year ago

High CPU utilization after consul client agent upgrade from 1.10.1 -> 1.14.3

Overview of the Issue

During the process of consul client upgrades from 1.10.1 -> 1.14.3 we noticed significant increase in CPU consumption by consul and consul-template (v0.29.0). After some analysis we narrowed down the cause of consul and consul-template interaction caused by below templates. (There are some call cross DC and we have 3 clusters WAN joined)

templates_reference.txt

Reproduction Steps

Please provide steps to reproduce the bug, without any details it would be hard to troubleshoot:

Steps to reproduce this issue, eg:

  1. Create a cluster with consul client agent running on 1.10.1 and conul-template with v10.29.0
  2. Ensure clusters are WAN jointed with another and use template similar to above
  3. Compare CPU utilization between the versions

Below image for reference on performance impact. Notes that these are physical machines with 20cores and 2CPU's so it's a pretty significant bump. image

We are looking into the templates themselves to so if we can improve them as well. But it would be great to understand why we are such an increase in resource utilization after upgrades? Please let me know if you need any more details to reproduce the issue.

huikang commented 1 year ago

@sri4kanne , does this happen only to cluster with consul-template?

sri4kanne commented 1 year ago

@huikang we use consul-template in all 5 clusters and on most of our agents/nodes have both consul and consul-template agents running on them. We did notice that once the consul-template was stopped on a node automatically consul CPU utilization also dropped. This is how we were able to narrow down to certain templates causing the spike in CPU utilization of both consul and consul-template processes.

Hillkorn commented 1 week ago

We've seen the same on our systems but as soon as we upgraded to 1.15 the CPU usage went down again. Seems to be something in 1.14 that got fixed in 1.15.