Open AndreiPashkin opened 1 year ago
@AndreiPashkin , could you provide more info about how the distributed locks and service-discovery
query in the cluster to help reproduce the issue? Thanks.
Hello, we've faced with the similar problem, Consul servers consumes all of dedicated RAM gradually. After that, rebooted and work normally 3-4 days. Resources for consul were raised 3 times for the last month.
The consul has been working fine for the last six days. Total load decreased and consul stopped to consume all of the dedicated RAM.
I've tried to analyze it with consul-snapshot-tool, but there was nothing special as in related issue - https://github.com/hashicorp/consul/issues/5327#issuecomment-469546415
Hello, we've gathered some metrics, hope it'll help with analyze consul-agent-metrics.txt
@AndreiPashkin , could you provide more info about how the distributed locks and
service-discovery
query in the cluster to help reproduce the issue? Thanks.
@huikang, I've captured logs using consul debug
when the issue was happening, I can post them. I also think that the issue is still reproducible and I can collect some additional info - but I need to know what specifically do you need to know.
Overview of the Issue
We use Consul in single-node mode for distributed locks and for service-discovery in our app. Service discovery used to connect our application environments with our monitoring. Consuls connect with each other over WAN.
After startup it starts consuming memory very quickly, memory usage goes over 30GB and quickly overwhelms our server. What I've found is that repeated calls to
consul info
shows thatgoroutines
number increases rapidly along with increase of memory usage.I'm also attaching logs.
I cap provide
consul debug
output by request of maintainers.Possibly related issues: #12564, #9076, #12288, #3111
Reproduction Steps
So far I haven't figure out how to reproduce it in isolated environment.
Consul info for both Client and Server
Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 1 build: prerelease = revision = 0e046bbb version = 1.13.2 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 4 leader = true leader_addr = 172.22.0.2:8300 server = true raft: applied_index = 14 commit_index = 14 fsm_pending = 0 last_contact = 0 last_log_index = 14 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 642723 max_procs = 8 os = linux version = go1.18.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 1 health_score = 4 intent_queue = 0 left = 0 member_time = 182 members = 5 query_queue = 0 query_time = 1 / # consul ^C / # consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 1 build: prerelease = revision = 0e046bbb version = 1.13.2 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 4 leader = true leader_addr = 172.22.0.2:8300 server = true raft: applied_index = 15 commit_index = 15 fsm_pending = 0 last_contact = 0 last_log_index = 15 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 818620 max_procs = 8 os = linux version = go1.18.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 1 health_score = 5 intent_queue = 0 left = 0 member_time = 182 members = 5 query_queue = 0 query_time = 1 / # consul info agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 1 build: prerelease = revision = 0e046bbb version = 1.13.2 version_metadata = consul: acl = disabled bootstrap = true known_datacenters = 4 leader = true leader_addr = 172.22.0.2:8300 server = true raft: applied_index = 16 commit_index = 16 fsm_pending = 0 last_contact = 0 last_log_index = 16 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 1147773 max_procs = 8 os = linux version = go1.18.1 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 2 health_score = 4 intent_queue = 0 left = 0 member_time = 182 members = 5 query_queue = 0 query_time = 1 ```Operating system and Environment details
Ubuntu 20.04
Log Fragments
https://gist.github.com/AndreiPashkin/0a95cdcb5e349c881ff4ee94af5f7b15
Version