Closed tantra35 closed 4 years ago
Thanks for reporting this, @tantra35! I'm going to take a look through the logs you've provided and see if I can replicate this.
@tgross Its very hard to reproduce, for example we cant simulate this on test stand only on production system :-(
Understood... but we'll figure it out! π
Just so that I understand the scenario you have here: you're hitting the /client/allocation/:alloc_id/stats
HTTP API on the server?
@tgross yes
@tantra35 I took a first pass through the logs you provided. The yamux goroutines that are in OpenStream
and AcceptStream
are blocking as a backpressure mechanism, so that bit is working as intended. But they're blocking on a large number of goroutines waiting for the yamux Read
calls to complete, which are in turn blocked waiting to receive. So that's definitely wrong but I haven't yet figured out why they're not receiving. But it also looks like this could be closely related to a long-standing open issue that you opened here https://github.com/hashicorp/nomad/issues/4604.
I'm going to assign myself to that issue as well (and I apologize that it was left to languish after you provided so much good info!). But in the meantime do you also see lots of open sockets when this goroutine leak is happening?
@tgross yes the situation is the same
In #4604 you reported that these were mostly sockets in CLOSE_WAIT
, and that the problem was mostly occurring when a server was talking cross-region to the client. Our clients initiate the connection to the servers so if the server has sockets in CLOSE_WAIT
that suggests the client is sending FIN
back to the server but the server isn't wrapping up the connection from its end.
I'm trying to see if I can reproduce the behavior but haven't yet done so successfully. Right now that environment looks like the following:
http://$NOMAD_ADDR/debug/pprof/
and http://$NOMAD_ADDR/debug/pprof/goroutine?debug=1
#!/bin/bash
while true
do
curl -so /dev/null "$NOMAD_ADDR/v1/client/allocation/$ALLOC_ID/stats"
echo -n "."
sleep 1
done
Then on the client I'm using tc
to disrupt the network traffic by inducing packet loss or delay. So far I haven't come up with a scenario that's causing it, but I've got a few more I want to try out tomorrow before I give up on this avenue.
I also think that the suggestion you had in the earlier issue to add the RPC timeout is probably right (but maybe not enough, as you noted), but I want to have a reproduction before I go changing things in Nomad. There's a deadline configuration in yamux we're not using.
One of the complicating factors for my tests yesterday was that if the Nomad client loses connectivity with a server, it'll establish a connection with another server. I've corrected this with a new test setup where I'm running the following script on the client:
#!/bin/bash
# must be run as root
SERVER_IPS=($(curl -s localhost:4646/v1/status/peers | jq -r '.[] | split(":") | .[0]'))
echo "dropping traffic from ${SERVER_IPS[0]} ${SERVER_IPS[1]}"
echo "shaping traffic from ${SERVER_IPS[2]}"
iptables -I INPUT -s ${SERVER_IPS[0]} -j DROP
iptables -I INPUT -s ${SERVER_IPS[1]} -j DROP
# create 3 queue-discipline classes
tc qdisc add dev eth0 root handle 1: prio
# configure these classes, with the low-priority band being set to drop 25% of
# packets with 25% correlation. this should be enough to cause TCP connections
# that we send thru that queue to break and force the Nomad client to try to
# reconnect
tc qdisc add dev eth0 parent 1:1 handle 10: sfq
tc qdisc add dev eth0 parent 1:2 handle 20: sfq
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 25% 25%
# filter traffic from server 3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
match ip dst ${SERVER_IPS[2]}/32 flowid 1:3
After verifying that we're seeing roughly 25% packet loss from the last server and that this is the only server the client can connect to, I set my NOMAD_ADDR
to the same server (and in a second attempt, a different server so that we exercise the forwarding code paths) and tried my curl
loop again. The HTTP calls to the stats endpoint do slow down and sometimes timeout in that situation. But the number of goroutines remains steady and there are only 2 stream Read calls happening at a time.
I've even ramped up to 50% and 75% packet loss to see if that could recreate the situation, but at higher packet loss levels the Nomad servers just mark the node as down and we get "no path to node". I eventually start getting lingering FIN_WAIT1
and FIN_WAIT2
at that level of degradation but not very many.
I'm going to leave this to soak for a couple hours to see if anything changes, but at this point I think my hypothesis that this has to do with simple network degradation is incorrect -- or at least more complex than that.
@tantra35 it might help if you can answer a few questions:
CLOSE_WAIT
, are they sockets that were opened with the clients?@tgross we placed inside aws cloud and does't see any network problems most time and use ubuntu 16.04lts as our main distro, with minimal turning of network parameters
Also i think that real problem is that in yamux nomad doesnt use any timeouts(we have patched version where implement them, its not ideal due nomad have blocked api and sometimes timeouts must but too huge: 5 minutes(maximum time)), but yamux does't allow to set timeout on connection phase only on read/write(on pathed version we observe that gotountines begin leaks in yamux connection phase OpenStream) i can demonstrate this with follow gorountines list and some stack traces
OpenStreamHung.txt 1.txt 1.1.txt 1.2.txt 1.2.trace1.txt 1.2.trace2.txt 1.2.trace3.txt 1.2.trace4.txt 2.txt
I get it on nomad version 0.8.6(but doens;t find any time to post bug report, also patched version was a stop factor)
I agree that the lack of timeouts is one issue. The blocked goroutines should all continue once we have some timeouts in place but I want to have a clear mechanism of the bug so that we're not just pushing the problem to some other place in the stack.
When you're talking about a patched version, you're referring to the patch from #4604, or is there some other patch you're using here?:
sc.stream.SetDeadline(time.Now().Add(10 * time.Second))
For the file 2.txt
you've sent, is this a particular client node that's been demonstrating this problem? Does this happen with any other client node? That EC2 instance has impressive uptime!
In #4604 you mentioned there was another region involved and that you weren't seeing the problem with same-region metrics collection. So the problem may be part of how we're handling RPC fowarding. If the diagram below describes this topology correctly, are you seeing the goroutine increase on Server A or Server B?
ββββββββββββββββββββββββββ βββββββββββββββββββββββββ
β region A β β region B β
β β β β
β metrics ββ> server A ββΌββββββΌβ> server B ββ> client β
β scraper β β β
ββββββββββββββββββββββββββ βββββββββββββββββββββββββ
@tgross Oh, it's been a long time since, and we abandoned such a scheme. So for now we doesn't do any cross region queries for statistic(metrics), and have a collecting daemon in each region. That's why we removed patch with timeouts we hoped that exactly cross region queries was a real reason of hung connections, but we were wrong, as practic shows
As for the patch with timeouts in #4604 you absolutely right, but we were extended this timeout to 5 minutes due blocking queries that can be in nomad calls(https://www.nomadproject.io/api/index.html#blocking-queries), and which max timeout is 5 minutes
As for the 2.txt
file in my prevoiuse post I posted it for the company(its my mistake), and it is irrelevant, and demonstrates other strange behavior in nomad 0.8.6, which is now on 0.9.6 is not reproduced
@tantra35 I'd still like to see if I can reproduce this. One of the avenues I'm investigating is whether the number of allocations involved is very high and that's exposing some kind of deadlocking condition we'd normally never hit in development environments. You mentioned this only happens in production -- roughly how many allocations per node is this? Is the HTTP client you're using to hit the API querying all the allocations as once, or sequentially?
@tgross Hm, and doesn't think that this is key of problem, due in our case on node launches 5-10 allocations average. Also i must said that this happens very strange in time, i mean that can be month and we doesn't see any problems, but some times this can be fews days continusly
@tgross hello we think that the real problem in our case is exactly leader switch, and the goruntine increase is consequence. look at follow screenshots
as you can see at 15:44 nomad switch leader, and at this point something went wrog and ip-172-27-5-61
which begins leader for small time that quickly lost leadership
and doesn't return to raft follower state, so in cluster at this point we have 2 leaders, which one of them is wrong in this case there was ip-172-27-5-61
. Due dynamically environment we doesn't have const number of worker nodes(they stating up ad down at autoscale). And as ip-172-27-5-61
think that it leader id doest update live nodes list and as result connection to dead nodes and we see at this moment constantly increased tcp socket usage at ip-172-27-5-61
and this increase only can be explained as connections to dead nodes(this is another bug in nomad, as i can make conclusion) due yamux streaming model(only some tcp connections) but i doesn't find yet place in nomad code base where this tcp connection making, anyway this looks very strange that tcp socket begin increase
Aslso i attached eval brocekr stat as buggi moment
There you also can see that at moment of switch ip-172-27-5-61
stop function properly until it restart
So there can be a some race condition at moment when from server node requested allocation stat and switch leadership happens, or this may be simple fast leadership lost which doesn't correctly handled by nomad
Also i provide some statistical info from logs at buggy moment
there we can see that new leader(ip-172-27-3-86
) claim to health of buggy nod(ip-172-27-5-61
)e, with follow messages
2019-12-16T12:59:59.229Z [WARN ] nomad.stats_fetcher: failed retrieving server health; last request still outstanding: server=ip-172-27-5-61.aav01
and follow log on buggy node:
"@timestamp",MESSAGE
"December 16th 2019, 12:59:17.484"," 2019-12-16T12:59:17.484Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/b251fbc6-62ca-3587-82df-790ef166832a/stats?node_id=657a4dfb-77be-b08e-f933-6eff4a2c49ef&namespace=default error=""Unknown allocation ""b251fbc6-62ca-3587-82df-790ef166832a"""" code=404"
"December 16th 2019, 12:57:41.418"," 2019-12-16T12:57:41.417Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509781"" index=1509781"
"December 16th 2019, 12:57:23.161"," 2019-12-16T12:57:23.161Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:57:16.415"," 2019-12-16T12:57:16.415Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509781"" index=1509781"
"December 16th 2019, 12:57:10.413"," 2019-12-16T12:57:10.413Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509781"" index=1509781"
"December 16th 2019, 12:56:21.816"," 2019-12-16T12:56:21.816Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:56:11.720"," 2019-12-16T12:56:11.719Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509779"" index=1509779"
"December 16th 2019, 12:56:05.717"," 2019-12-16T12:56:05.717Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509779"" index=1509779"
"December 16th 2019, 12:55:40.710"," 2019-12-16T12:55:40.710Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509778"" index=1509778"
"December 16th 2019, 12:55:34.708"," 2019-12-16T12:55:34.708Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509778"" index=1509778"
"December 16th 2019, 12:51:43.184"," 2019-12-16T12:51:43.184Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:50:45.991"," 2019-12-16T12:50:45.990Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:50:30.466"," 2019-12-16T12:50:30.465Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:50:20.988"," 2019-12-16T12:50:20.988Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:50:14.986"," 2019-12-16T12:50:14.985Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:48:18.133"," 2019-12-16T12:48:18.133Z [ERROR] nomad.vault: token revocation errored: error=""timed out enqueuing operation"""
"December 16th 2019, 12:46:58.554"," 2019-12-16T12:46:58.554Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:46:58.554"," 2019-12-16T12:46:58.554Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
"December 16th 2019, 12:46:57.551"," 2019-12-16T12:46:57.550Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:46:57.550"," 2019-12-16T12:46:57.550Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
"December 16th 2019, 12:45:59.478"," 2019-12-16T12:45:59.477Z [ERROR] nomad: failed to establish leadership: error=""force run of periodic job ""<ns: \""default\"", id: \""impala-copy\"">"" failed: timed out enqueuing operation"""
"December 16th 2019, 12:45:59.478"," 2019-12-16T12:45:59.477Z [ERROR] nomad.periodic: force run of periodic job failed: job=""<ns: ""default"", id: ""impala-copy"">"" error=""timed out enqueuing operation"""
"December 16th 2019, 12:45:59.477"," 2019-12-16T12:45:59.477Z [ERROR] nomad.periodic: failed to dispatch job: job=""<ns: ""default"", id: ""impala-copy"">"" error=""timed out enqueuing operation"""
"December 16th 2019, 12:45:29.477","Code: 400. Errors:"
"December 16th 2019, 12:45:29.477","* 1 error occurred:"
"December 16th 2019, 12:45:29.477","* invalid accessor"""
"December 16th 2019, 12:45:29.476"," 2019-12-16T12:45:29.476Z [WARN ] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error=""failed to revoke token (alloc: ""f95b0d03-af4e-099d-85ee-34463cd1c2cf"", node: ""2ffb6c3c-48e4-086c-070d-fe549dcdc087"", task: ""mongo-restore-stage""): Error making API request."
"December 16th 2019, 12:45:29.476","URL: POST https://vault.service.consul:8200/v1/auth/token/revoke-accessor"
"December 16th 2019, 12:45:27.480"," 2019-12-16T12:45:27.479Z [ERROR] http: request failed: method=GET path=/v1/job/impala-copy/periodic-1576500325?region=aav01&stale= error=""job not found"" code=404"
"December 16th 2019, 12:45:27.479"," 2019-12-16T12:45:27.479Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:45:22.116"," 2019-12-16T12:45:22.115Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:22.116"," 2019-12-16T12:45:22.115Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:22.115"," 2019-12-16T12:45:22.115Z [WARN ] nomad: raft: Failed to contact 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 in 500.288751ms"
"December 16th 2019, 12:45:22.115"," 2019-12-16T12:45:22.115Z [WARN ] nomad: raft: Failed to contact 7062e197-58e6-d20e-6122-846f2b198254 in 531.995222ms"
"December 16th 2019, 12:45:22.115"," 2019-12-16T12:45:22.115Z [WARN ] nomad: raft: Failed to contact quorum of nodes, stepping down"
"December 16th 2019, 12:45:22.083"," 2019-12-16T12:45:22.083Z [WARN ] nomad: raft: Failed to contact 7062e197-58e6-d20e-6122-846f2b198254 in 500.264924ms"
"December 16th 2019, 12:45:20.930"," 2019-12-16T12:45:20.930Z [INFO ] nomad: raft: pipelining replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:20.929"," 2019-12-16T12:45:20.929Z [INFO ] nomad: raft: pipelining replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:20.927"," 2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Election won. Tally: 2"
"December 16th 2019, 12:45:20.927"," 2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Added peer 7062e197-58e6-d20e-6122-846f2b198254, starting replication"
"December 16th 2019, 12:45:20.927"," 2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Added peer 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9, starting replication"
"December 16th 2019, 12:45:20.927"," 2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Leader] entering Leader state"
"December 16th 2019, 12:45:20.884"," 2019-12-16T12:45:20.883Z [INFO ] nomad: raft: Duplicate RequestVote for same term: 10116"
"December 16th 2019, 12:45:20.186"," 2019-12-16T12:45:20.185Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10116"
"December 16th 2019, 12:45:20.186"," 2019-12-16T12:45:20.185Z [WARN ] nomad: raft: Heartbeat timeout from """" reached, starting election"
"December 16th 2019, 12:45:19.175"," 2019-12-16T12:45:19.175Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:19.175"," 2019-12-16T12:45:19.175Z [WARN ] nomad: raft: Failed to contact 7062e197-58e6-d20e-6122-846f2b198254 in 500.071135ms"
"December 16th 2019, 12:45:19.175"," 2019-12-16T12:45:19.175Z [WARN ] nomad: raft: Failed to contact quorum of nodes, stepping down"
"December 16th 2019, 12:45:19.175"," 2019-12-16T12:45:19.175Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Follower] entering Follower state (Leader: """")"
"December 16th 2019, 12:45:19.175"," 2019-12-16T12:45:19.175Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:19.174"," 2019-12-16T12:45:19.174Z [WARN ] nomad: raft: Failed to contact 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 in 500.252525ms"
"December 16th 2019, 12:45:18.658"," 2019-12-16T12:45:18.658Z [ERROR] worker: failed to dequeue evaluation: error=""eval broker disabled"""
"December 16th 2019, 12:45:18.658"," 2019-12-16T12:45:18.658Z [ERROR] worker: failed to dequeue evaluation: error=""eval broker disabled"""
"December 16th 2019, 12:45:18.653"," 2019-12-16T12:45:18.652Z [INFO ] nomad: raft: pipelining replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:18.651"," 2019-12-16T12:45:18.650Z [INFO ] nomad: raft: pipelining replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:18.649"," 2019-12-16T12:45:18.649Z [INFO ] nomad: cluster leadership acquired"
"December 16th 2019, 12:45:18.649"," 2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Added peer 7062e197-58e6-d20e-6122-846f2b198254, starting replication"
"December 16th 2019, 12:45:18.649"," 2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Election won. Tally: 2"
"December 16th 2019, 12:45:18.649"," 2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Leader] entering Leader state"
"December 16th 2019, 12:45:18.649"," 2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Added peer 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9, starting replication"
"December 16th 2019, 12:45:18.638"," 2019-12-16T12:45:18.637Z [ERROR] worker: failed to dequeue evaluation: error=""rpc error: eval broker disabled"""
"December 16th 2019, 12:45:18.638"," 2019-12-16T12:45:18.637Z [ERROR] worker: failed to dequeue evaluation: error=""rpc error: eval broker disabled"""
"December 16th 2019, 12:45:17.480"," 2019-12-16T12:45:17.479Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10115"
"December 16th 2019, 12:45:17.479"," 2019-12-16T12:45:17.479Z [WARN ] nomad: raft: Election timeout reached, restarting election"
"December 16th 2019, 12:45:16.282"," 2019-12-16T12:45:16.281Z [WARN ] nomad: raft: Election timeout reached, restarting election"
"December 16th 2019, 12:45:16.282"," 2019-12-16T12:45:16.282Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10114"
"December 16th 2019, 12:45:14.932"," 2019-12-16T12:45:14.931Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10113"
"December 16th 2019, 12:45:14.932"," 2019-12-16T12:45:14.931Z [WARN ] nomad: raft: Heartbeat timeout from ""172.27.0.212:4647"" reached, starting election"
"December 16th 2019, 12:45:09.082"," 2019-12-16T12:45:09.082Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
"December 16th 2019, 12:45:09.082"," 2019-12-16T12:45:09.082Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:45:08.079"," 2019-12-16T12:45:08.079Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:45:08.079"," 2019-12-16T12:45:08.079Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
So here can be applied approach which i suggest in https://github.com/hashicorp/nomad/issues/4749
just reproduced this again and we got sockets stats(ss -nt
) and found
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:38246
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:51068
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:36990
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:43642
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:33904
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:41554
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:51460
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55616
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55814
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:37494
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:35766
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:47904
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:39322
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:34362
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:36104
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:49902
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53078
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:43852
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:59284
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:47666
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:54228
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:41550
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:60240
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:47136
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:40756
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:51652
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:49424
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:49698
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:50358
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:41432
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:39566
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:58876
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53524
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:38688
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:50508
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:48870
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:43610
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:35488
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:52040
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:52048
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:59254
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:36682
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:34342
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:60586
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:48054
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:59092
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55362
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53350
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:49828
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:50510
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:42262
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:46204
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:49416
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53282
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53546
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:58590
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:54466
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53510
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53962
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:56566
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:51554
ESTAB 0 0 ::ffff:172.27.5.61:8300 ::ffff:172.27.0.113:38434
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:35716
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:56240
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:38084
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:46098
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:60176
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:39866
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53924
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:39558
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:34628
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:36428
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:56468
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55304
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53460
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:37754
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:40960
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:51934
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:44364
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:41006
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:47722
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:59832
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55414
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:49160
ESTAB 0 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:38386
ESTAB 0 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.0.212:38798
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:50972
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53728
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:38264
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:52760
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:34454
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:52726
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:50460
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:41218
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:43300
ESTAB 0 0 ::ffff:172.27.5.61:8300 ::ffff:172.27.0.6:30932
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:45744
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:53640
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:35020
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:51466
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55266
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:49838
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:43630
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:51916
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:58798
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:45956
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55448
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:41762
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:36604
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:57096
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:55940
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:34530
CLOSE-WAIT 1 0 ::ffff:172.27.5.61:4647 ::ffff:172.27.3.86:47538
so this is not a communication error with worker nodes, but problem in raft interoperation itself
So problem described here is more complex then a think before:
nomad
not very priority product
- presenting 2 leaders at the same time(with this we fight in 0.8.6 with patch described in #4749, and it appear again for us as in 0.9.x we doesn't apply this modifications)
That's a pretty serious correctness issue; I'm going to re-open #4749 so we can follow up on that one. Looks like stalebot got overzealous.
- hung yamux connections(reasons of this we cant find yet)
I don't see an obvious way this would be related to multiple leaders, ~but if we can get into that state there's a whole mess of problems that'll fall out of that because having exactly one leader is one of our core assertions.~ (Edit: turns out that isn't the case, see my comments in https://github.com/hashicorp/nomad/issues/4749#issuecomment-571253655) If there's a way we could replicate the leader issue it might give us more insight.
- absolutely ignoring of hashicorp team of reports from community. The only reason why this happens i can imagine to myself is that nomad not very priority product
I definitely understand your frustration with this. For what it's worth, the Nomad engineering team has been staffed up quite a bit in the last year (especially in the last 6 months). And my particular focus has been around reliability and stability issues. Thanks so much for how much you've been willing to contribute back in extensive bug reports! We'll get there!
Unfortunately it looks like we're missing the original election failover for node 172.27.5.61 in the logs you've provided -- we can see that the node is already in a flapping state and trying to execute leader operations (and failing to do so because it's not the raft leader). It probably doesn't matter how we got into the flapping state in the first place; there's plenty of networking or other reasons why a node might flap and we need to account for that. Like in #4749 we can also see that the node has lost connectivity or has intermittent connectivity to the other two server nodes. It manages to win an election but then times out on heartbeats ~500ms later.
From the output of ss
we can see that the buggy node 172.27.5.61 has a bunch of connections in CLOSE_WAIT
and all of them are from 172.27.3.86, the new leader. So the pile-up of goroutines isn't because of connections w/ dead nodes (clients make inbound connections with servers anyways, not the other way around, so even if the server somehow had stale data that wouldn't cause a bunch of outbound connections).
That narrows down the places to look quite a bit in terms of trying to reproduce. While I work on that, if you can reproduce while running with debug mode on, could you dump the output of curl
ing /debug/pprof/goroutine?debug=2
to a file?
I've been able to get similar log outputs from test clusters of 3 servers by running the script below on one of the machines. I've also tried adding extra load (by running CPU stress tests on AWS t2
class machines, to trigger CPU throttling). I'm then polling /debug/pprof/goroutine?debug=1
to get the count of goroutines. Leadership flaps and forward progress is slow, but unfortunately this hasn't resulted in a runaway number of goroutines and network connections in CLOSE_WAIT
@tantra35 you've mentioned a couple times both 0.8.x and 0.9.x -- is this happening similarly on both clusters? If not, what's the exact version of the cluster involved? And is there any chance I can get debug-level logs? The raft system does some logging there that might be helpful to us.
#!/bin/bash
set -e
trap cleanup err exit
cleanup() {
echo
echo "removing queue disciplines!"
sudo tc qdisc del dev eth0 root
}
MY_IP=$(ip addr | grep 'state UP' -A2 | tail -n1 | awk '{print $2}' | cut -f1 -d'/')
SERVER_IPS=($(curl -s localhost:4646/v1/status/peers | jq -r '.[] | split(":") | .[0]'))
# clear any old queue disciplines, if any
sudo tc qdisc del dev eth0 root > /dev/null 2>&1 || true
echo "Hit Ctrl-C to exit..."
while :
do
# configure queue discipline classes
sudo tc qdisc add dev eth0 root handle 1: prio
sudo tc qdisc add dev eth0 parent 1:1 handle 10: sfq
sudo tc qdisc add dev eth0 parent 1:2 handle 20: sfq
sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 75% 75%
# attach the filter to the peers
for SERVER_IP in ${SERVER_IPS[@]}; do
if [[ "$SERVER_IP" != "$MY_IP" ]]; then
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
match ip dst "${SERVER_IP}/32" flowid 1:3
fi
done
echo -n '.'
sleep 2
sudo tc qdisc del dev eth0 root
sleep 5
done
@tgross
our t2 instances never was overloaded, so they never been throttled(despite this we set on them Unlimited (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-unlimited-mode.html))
If you have similar logs, that means that you must doesnt see pairs of messages:
nomad: cluster leadership lost
nomad: cluster leadership acquired"
As you doesn't see any leaked sockets, I am 100% sure, that in your logs you can see this pair, and this is mean that it is not our case. At some moment we doesn't see cluster leadership lost
despite wee see in logs that raft leadership is lost
I think that your miss fact that raft leadership detecting by nomad by reading buffered channel leaderCh
(https://github.com/hashicorp/nomad/blob/master/nomad/server.go#L1237), with buffer count 1, and when we overload this buffer we have that all raft stuff code will block, so there must happens multiple flapping in that buffer to overlap it. So may be the key, to reproduce is to disable this buffering(i doesn't check this idea), also from this can may follow a solution - increase this buffer to something more bigger
We Apply patch described in #4749, and now doesn't see any problems about 19 days(since 22.12.2019, and graph you can see that before we have issues more often, ofcourse this is not mean that we solve issue fully - need more time to check): So fo now not possible to provide profiling or debug logs
From old leadership problems we have some errors in logs like this:
"December 16th 2019, 12:45:08.079"," 2019-12-16T12:45:08.079Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
This is due, leader generate new token(managment) every time when it changes, and this token as i understand remembered in eval, so we have some this buggy evals that will never end up. How we can solve this?
If you have similar logs, that means that you must doesnt see pairs of messages:
nomad: cluster leadership lost nomad: cluster leadership acquired"
As you doesn't see any leaked sockets, I am 100% sure, that in your logs you can see this pair, and this is mean that it is not our case. At some moment we doesn't see cluster leadership lost despite wee see in logs that raft leadership is lost
I just want to make sure I understand very clearly:
nomad: cluster leadership acquired
but you never see the nomad: cluster leadership lost
log on any of the servers?We doesn't see nomad: cluster leadership lost
on nomad server with leaked sockets, other servers behave as expected
If you see massage pair, this means that in your test case nomad work perfectly and doesn't have any problems, in our case nomad doesn't leave leader state fully, due in logs doesn't appear cluster leadership lost
and we have 2 leaders until restart of buggy server.
For example this buggy server does't update cluster state, and if we request for allocation that appear after bug happens, this server response that allocation not found, and other strange things happens when we communicate with cluster(request allocations, collect allocations stat, launch jobs etc), so this looks like split brain
Just wanted to add that we are experiencing the same issue. We're making calls to /v1/client/allocation/<alloc-id>/stats
and once in a while the call just hangs (does not timeout or return any other error). Besides the fact that the Java SDK does not have any timeout implemented (a problem on its own), the only solution is to do a rolling restart of the nomad servers. For us it happens about once every few days, but thanks to @tantra35's debugging info, I now understand a little bit more why. Not sure if I can provide more info. For now, I was unable to reproduce it.
@filip-vt if you full enough brave jeday, you can try our binary https://yadi.sk/d/QnKOvsBaY4fSog linux only
@tantra35 I just wanted to follow up with you on this issue. On Feb 19 we shipped 0.10.4. My colleague @notnoop landed a fix to leadership flapping (https://github.com/hashicorp/nomad/pull/6977), which combined with the RPC fix we shipped in 0.10.3 we believe should fix the problems in this issue and in #4749
@tgross We saw all changes that was maded by hashicop team. Good work, but at now we still use 0.9.7(with our custom patch) version due we not actually believe that 0.10.4 is ready for production(we must backport https://github.com/hashicorp/nomad/pull/6324 to 0.10.4), also problem with leaked goroutines solved only partial (we eliminate raft problems, but not eliminate https://github.com/hashicorp/nomad/issues/4604, in that issue i provided more info that we found)
Thanks @tantra35. I'm going to close this issue and we can follow up on any remaining issues in #4606.
I'm going to lock this issue because it has been closed for 120 days β³. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
0.8.x - latest nomad version
Operating system and Environment details
Issue
We making nomad api calls to collect telemetry for jobs and time to time see that goroutine leaks appear on control plane(we remove them by restartting nomad with leaked goroutines)
This happens in yamux code, look on our goroutines which we took with delve debugger glist.txt
also we get some backtrace of some goroutines with most typical behavior bt-1119376.txt bt-821359.txt bt-886686.txt
Backtraces was collected on nomad-0.9.6