Open nvx opened 2 years ago
Slightly tangential to the main issue is that the reason Consul was restarted is because the openrc health check failed. I'm not sure why it failed at the time, but re-running the health check while Consul was in the broken state gave a HTTP 429 error:
# consul config list -kind proxy-defaults
Error listing config entries for kind "proxy-defaults": Unexpected response code: 429 (Your IP is issuing too many concurrent connections, please rate limit your calls
)
Note that netstat only reported 200 odd open sockets by Consul, and the agent configuration has the following:
limits {
http_max_conns_per_client = 2000
}
I'm not sure if the original reason the health check failed was because the 2000 was exhausted at the time and if that contributed to a failure to restart, or if the 429 error is a red herring caused by whatever state it got into after receiving the SIGTERM.
From a quick look at the shutdown code comparing against what got logged, it looks like the hang happened somewhere within this block of code:
https://github.com/hashicorp/consul/blob/v1.10.4/agent/agent.go#L1351-L1413
Unfortunately I didn't think of sending a SIGABRT to get a stack trace before restarting the process, so I'm not sure exactly where within that block it hung.
Hey @nvx
Thanks for raising this issue, and for diving deep to get us some more info on the condition. That'll help a ton in figuring out what might have caused this. We'll investigate a bit more into this and let you know if we need any more information, do let us know if this happens again in the meantime!
Will do. I've only noticed it once so far so I'm guessing it's fairly rare edge case. Will be sure to send a SIGABRT next time to try and get the exact stack trace too.
Thanks for the detailed report! We had a quick look, and here's what we found.
We did not look at every method called in that block, but one we found that could block is ServiceManager.Stop
. Our colleague @jfreeland recently ran GCatch on Consul over in https://github.com/hashicorp/consul/issues/9913#issuecomment-1026291022. It found a few things, one of which is this:
Call Chain:
AddService (at /home/joey/go/src/github.com/hashicorp/consul/agent/service_manager.go: 95) ->
register (at /home/joey/go/src/github.com/hashicorp/consul/agent/service_manager.go: 158) ->
addServiceInternal (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2217) ->
cleanupRegistration (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2351) ->
removeServiceSidecars (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2452) ->
removeServiceLocked (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2395) ->
RemoveService
Given this deadlock can happen in exactly this shutdown scenario it is possible this deadlock is the cause. cleanupRegistration
will be called when shutdown happens. So if SIGTERM
is sent while a service is being registered, it can deadlock (because it tries to double acquire a lock). The goroutine blocked on this deadlock would prevent ServiceManager.Stop
from exiting, which could cause the hang during shutdown.
It's still possible the problem is somewhere else, but this seems like a good lead to investigate further.
That could certainly explain it. The Consul agent in question was running on a Nomad agent, so a high churn of service registrations would be expected (along with some other Consul functionality including KV reads/watches, and health checking).
When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.
Overview of the Issue
I noticed a node had disappeared from Consul. Having a closer look at the box, it looks like Consul was attempted to be restarted on the host by sending a SIGTERM to Consul, but Consul did not quit.
Reproduction Steps
Sending SIGTERM to a running Consul agent. Note attempts to reproduce this on the same host after forcibly killing Consul with SIGKILL did not reproduce the issue.
Consul info for both Client and Server
Client info
``` agent: check_monitors = 0 check_ttls = 1 checks = 61 services = 62 build: prerelease = revision = 7bbad6fe version = 1.10.4 consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 497 max_procs = 4 os = linux version = go1.16.10 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 962 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106534 members = 168 query_queue = 0 query_time = 12 ```Operating system and Environment details
Alpine Linux 3.15
Log Fragments
Of interest after doing a SIGKILL and attempting to reproduce the issue by sending a SIGTERM to a newly restarted Consul client, the shutdown logs when it worked had an additional log line after "agent: Requesting shutdown" which was not present in the hung Consul log:
I imagine however that is triggered is the source of it hanging. Perhaps a race condition since it doesn't seem easily reproducible.