hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Consul consumes tens GB or RAM for no reason #16290

Open AndreiPashkin opened 1 year ago

AndreiPashkin commented 1 year ago

Overview of the Issue

We use Consul in single-node mode for distributed locks and for service-discovery in our app. Service discovery used to connect our application environments with our monitoring. Consuls connect with each other over WAN.

After startup it starts consuming memory very quickly, memory usage goes over 30GB and quickly overwhelms our server. What I've found is that repeated calls to consul info shows that goroutines number increases rapidly along with increase of memory usage.

I'm also attaching logs.

I cap provide consul debug output by request of maintainers.

Possibly related issues: #12564, #9076, #12288, #3111

Reproduction Steps

So far I haven't figure out how to reproduce it in isolated environment.

Consul info for both Client and Server

Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 1 build: prerelease = revision = 0e046bbb version = 1.13.2 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 4 leader = true leader_addr = 172.22.0.2:8300 server = true raft: applied_index = 14 commit_index = 14 fsm_pending = 0 last_contact = 0 last_log_index = 14 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 642723 max_procs = 8 os = linux version = go1.18.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 1 health_score = 4 intent_queue = 0 left = 0 member_time = 182 members = 5 query_queue = 0 query_time = 1 / # consul ^C / # consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 1 build: prerelease = revision = 0e046bbb version = 1.13.2 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 4 leader = true leader_addr = 172.22.0.2:8300 server = true raft: applied_index = 15 commit_index = 15 fsm_pending = 0 last_contact = 0 last_log_index = 15 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 818620 max_procs = 8 os = linux version = go1.18.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 1 health_score = 5 intent_queue = 0 left = 0 member_time = 182 members = 5 query_queue = 0 query_time = 1 / # consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 1 build: prerelease = revision = 0e046bbb version = 1.13.2 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 4 leader = true leader_addr = 172.22.0.2:8300 server = true raft: applied_index = 16 commit_index = 16 fsm_pending = 0 last_contact = 0 last_log_index = 16 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 1147773 max_procs = 8 os = linux version = go1.18.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 2 health_score = 4 intent_queue = 0 left = 0 member_time = 182 members = 5 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Ubuntu 20.04

Log Fragments

https://gist.github.com/AndreiPashkin/0a95cdcb5e349c881ff4ee94af5f7b15

Version

# consul version
Consul v1.13.2
Revision 0e046bbb
Build Date 2022-09-20T20:30:07Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
huikang commented 1 year ago

@AndreiPashkin , could you provide more info about how the distributed locks and service-discovery query in the cluster to help reproduce the issue? Thanks.

PavelYadrov commented 1 year ago

Hello, we've faced with the similar problem, Consul servers consumes all of dedicated RAM gradually. After that, rebooted and work normally 3-4 days. Resources for consul were raised 3 times for the last month.

The consul has been working fine for the last six days. Total load decreased and consul stopped to consume all of the dedicated RAM.

I've tried to analyze it with consul-snapshot-tool, but there was nothing special as in related issue - https://github.com/hashicorp/consul/issues/5327#issuecomment-469546415

PavelYadrov commented 1 year ago

Hello, we've gathered some metrics, hope it'll help with analyze consul-agent-metrics.txt

AndreiPashkin commented 1 year ago

@AndreiPashkin , could you provide more info about how the distributed locks and service-discovery query in the cluster to help reproduce the issue? Thanks.

@huikang, I've captured logs using consul debug when the issue was happening, I can post them. I also think that the issue is still reproducible and I can collect some additional info - but I need to know what specifically do you need to know.