hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.37k stars 4.42k forks source link

Consul agent SIGTERM sometimes hangs #12247

Open nvx opened 2 years ago

nvx commented 2 years ago

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

I noticed a node had disappeared from Consul. Having a closer look at the box, it looks like Consul was attempted to be restarted on the host by sending a SIGTERM to Consul, but Consul did not quit.

Reproduction Steps

Sending SIGTERM to a running Consul agent. Note attempts to reproduce this on the same host after forcibly killing Consul with SIGKILL did not reproduce the issue.

Consul info for both Client and Server

Client info ``` agent: check_monitors = 0 check_ttls = 1 checks = 61 services = 62 build: prerelease = revision = 7bbad6fe version = 1.10.4 consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 497 max_procs = 4 os = linux version = go1.16.10 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 962 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106534 members = 168 query_queue = 0 query_time = 12 ```

Operating system and Environment details

Alpine Linux 3.15

Log Fragments

2022-02-01T15:45:16.619Z [INFO]  agent.client.memberlist.lan: memberlist: Suspect fintstoradb has failed, no acks received
2022-02-01T15:51:25.153Z [INFO]  agent: Caught: signal=terminated
2022-02-01T15:51:25.153Z [INFO]  agent: Gracefully shutting down agent...
2022-02-01T15:51:25.153Z [INFO]  agent.client: client starting leave
2022-02-01T15:51:25.858Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: nomad-worker-3 10.x.x.x
2022-02-01T15:51:29.619Z [INFO]  agent: Graceful exit completed
2022-02-01T15:51:29.619Z [INFO]  agent: Requesting shutdown
2022-02-01T15:51:45.030Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.x.x.243:8300 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:51:45.038Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/create from=172.18.0.5:34908 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:51:45.302Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.x.x.241:8300 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:51:45.302Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/create from=172.18.0.13:37236 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:51:46.404Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.x.x.242:8300 error="rpc error making call: apply failed: Missing node registration"
2022-02-01T15:51:46.404Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/create from=172.18.0.26:35110 error="rpc error making call: apply failed: Missing node registration"
2022-02-01T15:52:03.435Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.x.x.243:8300 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:52:03.435Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/create from=172.18.0.5:34910 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:52:03.904Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.x.x.241:8300 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:52:03.904Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/create from=172.18.0.13:37238 error="rpc error making call: rpc error making call: apply failed: Missing node registration"
2022-02-01T15:52:05.192Z [ERROR] agent.client: RPC failed to server: method=Session.Apply server=10.x.x.242:8300 error="rpc error making call: apply failed: Missing node registration"
2022-02-01T15:52:05.192Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/create from=172.18.0.5:34912 error="rpc error making call: apply failed: Missing node registration"

Of interest after doing a SIGKILL and attempting to reproduce the issue by sending a SIGTERM to a newly restarted Consul client, the shutdown logs when it worked had an additional log line after "agent: Requesting shutdown" which was not present in the hung Consul log:

2022-02-02T00:20:32.840Z [INFO]  agent.client: shutting down client

I imagine however that is triggered is the source of it hanging. Perhaps a race condition since it doesn't seem easily reproducible.

nvx commented 2 years ago

Slightly tangential to the main issue is that the reason Consul was restarted is because the openrc health check failed. I'm not sure why it failed at the time, but re-running the health check while Consul was in the broken state gave a HTTP 429 error:

# consul config list -kind proxy-defaults
Error listing config entries for kind "proxy-defaults": Unexpected response code: 429 (Your IP is issuing too many concurrent connections, please rate limit your calls
)

Note that netstat only reported 200 odd open sockets by Consul, and the agent configuration has the following:

limits {
  http_max_conns_per_client = 2000
}

I'm not sure if the original reason the health check failed was because the 2000 was exhausted at the time and if that contributed to a failure to restart, or if the 429 error is a red herring caused by whatever state it got into after receiving the SIGTERM.

nvx commented 2 years ago

From a quick look at the shutdown code comparing against what got logged, it looks like the hang happened somewhere within this block of code:

https://github.com/hashicorp/consul/blob/v1.10.4/agent/agent.go#L1351-L1413

Unfortunately I didn't think of sending a SIGABRT to get a stack trace before restarting the process, so I'm not sure exactly where within that block it hung.

Amier3 commented 2 years ago

Hey @nvx

Thanks for raising this issue, and for diving deep to get us some more info on the condition. That'll help a ton in figuring out what might have caused this. We'll investigate a bit more into this and let you know if we need any more information, do let us know if this happens again in the meantime!

nvx commented 2 years ago

Will do. I've only noticed it once so far so I'm guessing it's fairly rare edge case. Will be sure to send a SIGABRT next time to try and get the exact stack trace too.

dnephin commented 2 years ago

Thanks for the detailed report! We had a quick look, and here's what we found.

We did not look at every method called in that block, but one we found that could block is ServiceManager.Stop. Our colleague @jfreeland recently ran GCatch on Consul over in https://github.com/hashicorp/consul/issues/9913#issuecomment-1026291022. It found a few things, one of which is this:

Call Chain:
AddService (at /home/joey/go/src/github.com/hashicorp/consul/agent/service_manager.go: 95) ->
register (at /home/joey/go/src/github.com/hashicorp/consul/agent/service_manager.go: 158) ->
addServiceInternal (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2217) ->
cleanupRegistration (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2351) ->
removeServiceSidecars (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2452) ->
removeServiceLocked (at /home/joey/go/src/github.com/hashicorp/consul/agent/agent.go: 2395) ->
RemoveService

Given this deadlock can happen in exactly this shutdown scenario it is possible this deadlock is the cause. cleanupRegistration will be called when shutdown happens. So if SIGTERM is sent while a service is being registered, it can deadlock (because it tries to double acquire a lock). The goroutine blocked on this deadlock would prevent ServiceManager.Stop from exiting, which could cause the hang during shutdown.

It's still possible the problem is somewhere else, but this seems like a good lead to investigate further.

nvx commented 2 years ago

That could certainly explain it. The Consul agent in question was running on a Nomad agent, so a high churn of service registrations would be expected (along with some other Consul functionality including KV reads/watches, and health checking).