hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.64k stars 1.93k forks source link

goroutine leaks when making nomad api cals #6620

Closed tantra35 closed 4 years ago

tantra35 commented 4 years ago

Nomad version

0.8.x - latest nomad version

Operating system and Environment details

Issue

We making nomad api calls to collect telemetry for jobs and time to time see that goroutine leaks appear on control plane(we remove them by restartting nomad with leaked goroutines)

img-2019-11-05-16-32-43

This happens in yamux code, look on our goroutines which we took with delve debugger glist.txt

also we get some backtrace of some goroutines with most typical behavior bt-1119376.txt bt-821359.txt bt-886686.txt

Backtraces was collected on nomad-0.9.6

tgross commented 4 years ago

Thanks for reporting this, @tantra35! I'm going to take a look through the logs you've provided and see if I can replicate this.

tantra35 commented 4 years ago

@tgross Its very hard to reproduce, for example we cant simulate this on test stand only on production system :-(

tgross commented 4 years ago

Understood... but we'll figure it out! πŸ˜€

Just so that I understand the scenario you have here: you're hitting the /client/allocation/:alloc_id/stats HTTP API on the server?

tantra35 commented 4 years ago

@tgross yes

tgross commented 4 years ago

@tantra35 I took a first pass through the logs you provided. The yamux goroutines that are in OpenStream and AcceptStream are blocking as a backpressure mechanism, so that bit is working as intended. But they're blocking on a large number of goroutines waiting for the yamux Read calls to complete, which are in turn blocked waiting to receive. So that's definitely wrong but I haven't yet figured out why they're not receiving. But it also looks like this could be closely related to a long-standing open issue that you opened here https://github.com/hashicorp/nomad/issues/4604.

I'm going to assign myself to that issue as well (and I apologize that it was left to languish after you provided so much good info!). But in the meantime do you also see lots of open sockets when this goroutine leak is happening?

tantra35 commented 4 years ago

@tgross yes the situation is the same

ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅

tgross commented 4 years ago

In #4604 you reported that these were mostly sockets in CLOSE_WAIT, and that the problem was mostly occurring when a server was talking cross-region to the client. Our clients initiate the connection to the servers so if the server has sockets in CLOSE_WAIT that suggests the client is sending FIN back to the server but the server isn't wrapping up the connection from its end.

I'm trying to see if I can reproduce the behavior but haven't yet done so successfully. Right now that environment looks like the following:

#!/bin/bash
while true
do 
  curl -so /dev/null "$NOMAD_ADDR/v1/client/allocation/$ALLOC_ID/stats"
  echo -n "."
  sleep 1
done

Then on the client I'm using tc to disrupt the network traffic by inducing packet loss or delay. So far I haven't come up with a scenario that's causing it, but I've got a few more I want to try out tomorrow before I give up on this avenue.


I also think that the suggestion you had in the earlier issue to add the RPC timeout is probably right (but maybe not enough, as you noted), but I want to have a reproduction before I go changing things in Nomad. There's a deadline configuration in yamux we're not using.

tgross commented 4 years ago

One of the complicating factors for my tests yesterday was that if the Nomad client loses connectivity with a server, it'll establish a connection with another server. I've corrected this with a new test setup where I'm running the following script on the client:

#!/bin/bash
# must be run as root

SERVER_IPS=($(curl -s localhost:4646/v1/status/peers | jq -r '.[] | split(":") | .[0]'))

echo "dropping traffic from ${SERVER_IPS[0]} ${SERVER_IPS[1]}"
echo "shaping traffic from ${SERVER_IPS[2]}"

iptables -I INPUT -s ${SERVER_IPS[0]} -j DROP
iptables -I INPUT -s ${SERVER_IPS[1]} -j DROP

# create 3 queue-discipline classes
tc qdisc add dev eth0 root handle 1: prio

# configure these classes, with the low-priority band being set to drop 25% of
# packets with 25% correlation. this should be enough to cause TCP connections
# that we send thru that queue to break and force the Nomad client to try to 
# reconnect
tc qdisc add dev eth0 parent 1:1 handle 10: sfq
tc qdisc add dev eth0 parent 1:2 handle 20: sfq
tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 25% 25%

# filter traffic from server 3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
     match ip dst ${SERVER_IPS[2]}/32 flowid 1:3

After verifying that we're seeing roughly 25% packet loss from the last server and that this is the only server the client can connect to, I set my NOMAD_ADDR to the same server (and in a second attempt, a different server so that we exercise the forwarding code paths) and tried my curl loop again. The HTTP calls to the stats endpoint do slow down and sometimes timeout in that situation. But the number of goroutines remains steady and there are only 2 stream Read calls happening at a time.

I've even ramped up to 50% and 75% packet loss to see if that could recreate the situation, but at higher packet loss levels the Nomad servers just mark the node as down and we get "no path to node". I eventually start getting lingering FIN_WAIT1 and FIN_WAIT2 at that level of degradation but not very many.

I'm going to leave this to soak for a couple hours to see if anything changes, but at this point I think my hypothesis that this has to do with simple network degradation is incorrect -- or at least more complex than that.

@tantra35 it might help if you can answer a few questions:

tantra35 commented 4 years ago

@tgross we placed inside aws cloud and does't see any network problems most time and use ubuntu 16.04lts as our main distro, with minimal turning of network parameters

Also i think that real problem is that in yamux nomad doesnt use any timeouts(we have patched version where implement them, its not ideal due nomad have blocked api and sometimes timeouts must but too huge: 5 minutes(maximum time)), but yamux does't allow to set timeout on connection phase only on read/write(on pathed version we observe that gotountines begin leaks in yamux connection phase OpenStream) i can demonstrate this with follow gorountines list and some stack traces

OpenStreamHung.txt 1.txt 1.1.txt 1.2.txt 1.2.trace1.txt 1.2.trace2.txt 1.2.trace3.txt 1.2.trace4.txt 2.txt

I get it on nomad version 0.8.6(but doens;t find any time to post bug report, also patched version was a stop factor)

tgross commented 4 years ago

I agree that the lack of timeouts is one issue. The blocked goroutines should all continue once we have some timeouts in place but I want to have a clear mechanism of the bug so that we're not just pushing the problem to some other place in the stack.

When you're talking about a patched version, you're referring to the patch from #4604, or is there some other patch you're using here?:

sc.stream.SetDeadline(time.Now().Add(10 * time.Second))

For the file 2.txt you've sent, is this a particular client node that's been demonstrating this problem? Does this happen with any other client node? That EC2 instance has impressive uptime!

In #4604 you mentioned there was another region involved and that you weren't seeing the problem with same-region metrics collection. So the problem may be part of how we're handling RPC fowarding. If the diagram below describes this topology correctly, are you seeing the goroutine increase on Server A or Server B?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” 
β”‚        region A        β”‚     β”‚       region B        β”‚
β”‚                        β”‚     β”‚                       β”‚
β”‚  metrics ──> server A ─┼─────┼─> server B ──> client β”‚
β”‚  scraper               β”‚     β”‚                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ 
tantra35 commented 4 years ago

@tgross Oh, it's been a long time since, and we abandoned such a scheme. So for now we doesn't do any cross region queries for statistic(metrics), and have a collecting daemon in each region. That's why we removed patch with timeouts we hoped that exactly cross region queries was a real reason of hung connections, but we were wrong, as practic shows

As for the patch with timeouts in #4604 you absolutely right, but we were extended this timeout to 5 minutes due blocking queries that can be in nomad calls(https://www.nomadproject.io/api/index.html#blocking-queries), and which max timeout is 5 minutes

As for the 2.txt file in my prevoiuse post I posted it for the company(its my mistake), and it is irrelevant, and demonstrates other strange behavior in nomad 0.8.6, which is now on 0.9.6 is not reproduced

tgross commented 4 years ago

@tantra35 I'd still like to see if I can reproduce this. One of the avenues I'm investigating is whether the number of allocations involved is very high and that's exposing some kind of deadlocking condition we'd normally never hit in development environments. You mentioned this only happens in production -- roughly how many allocations per node is this? Is the HTTP client you're using to hit the API querying all the allocations as once, or sequentially?

tantra35 commented 4 years ago

@tgross Hm, and doesn't think that this is key of problem, due in our case on node launches 5-10 allocations average. Also i must said that this happens very strange in time, i mean that can be month and we doesn't see any problems, but some times this can be fews days continusly

tantra35 commented 4 years ago

@tgross hello we think that the real problem in our case is exactly leader switch, and the goruntine increase is consequence. look at follow screenshots

ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅

as you can see at 15:44 nomad switch leader, and at this point something went wrog and ip-172-27-5-61 which begins leader for small time that quickly lost leadership

ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅

and doesn't return to raft follower state, so in cluster at this point we have 2 leaders, which one of them is wrong in this case there was ip-172-27-5-61. Due dynamically environment we doesn't have const number of worker nodes(they stating up ad down at autoscale). And as ip-172-27-5-61 think that it leader id doest update live nodes list and as result connection to dead nodes and we see at this moment constantly increased tcp socket usage at ip-172-27-5-61

ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅

and this increase only can be explained as connections to dead nodes(this is another bug in nomad, as i can make conclusion) due yamux streaming model(only some tcp connections) but i doesn't find yet place in nomad code base where this tcp connection making, anyway this looks very strange that tcp socket begin increase

Aslso i attached eval brocekr stat as buggi moment

ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅

There you also can see that at moment of switch ip-172-27-5-61 stop function properly until it restart

So there can be a some race condition at moment when from server node requested allocation stat and switch leadership happens, or this may be simple fast leadership lost which doesn't correctly handled by nomad

Also i provide some statistical info from logs at buggy moment

ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅

there we can see that new leader(ip-172-27-3-86) claim to health of buggy nod(ip-172-27-5-61)e, with follow messages

2019-12-16T12:59:59.229Z [WARN ] nomad.stats_fetcher: failed retrieving server health; last request still outstanding: server=ip-172-27-5-61.aav01

and follow log on buggy node:

"@timestamp",MESSAGE
"December 16th 2019, 12:59:17.484","    2019-12-16T12:59:17.484Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/b251fbc6-62ca-3587-82df-790ef166832a/stats?node_id=657a4dfb-77be-b08e-f933-6eff4a2c49ef&namespace=default error=""Unknown allocation ""b251fbc6-62ca-3587-82df-790ef166832a"""" code=404"
"December 16th 2019, 12:57:41.418","    2019-12-16T12:57:41.417Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509781"" index=1509781"
"December 16th 2019, 12:57:23.161","    2019-12-16T12:57:23.161Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:57:16.415","    2019-12-16T12:57:16.415Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509781"" index=1509781"
"December 16th 2019, 12:57:10.413","    2019-12-16T12:57:10.413Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509781"" index=1509781"
"December 16th 2019, 12:56:21.816","    2019-12-16T12:56:21.816Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:56:11.720","    2019-12-16T12:56:11.719Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509779"" index=1509779"
"December 16th 2019, 12:56:05.717","    2019-12-16T12:56:05.717Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509779"" index=1509779"
"December 16th 2019, 12:55:40.710","    2019-12-16T12:55:40.710Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509778"" index=1509778"
"December 16th 2019, 12:55:34.708","    2019-12-16T12:55:34.708Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509778"" index=1509778"
"December 16th 2019, 12:51:43.184","    2019-12-16T12:51:43.184Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:50:45.991","    2019-12-16T12:50:45.990Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:50:30.466","    2019-12-16T12:50:30.465Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:50:20.988","    2019-12-16T12:50:20.988Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:50:14.986","    2019-12-16T12:50:14.985Z [ERROR] worker: error waiting for Raft index: error=""timed out after 5s waiting for index=1509756"" index=1509756"
"December 16th 2019, 12:48:18.133","    2019-12-16T12:48:18.133Z [ERROR] nomad.vault: token revocation errored: error=""timed out enqueuing operation"""
"December 16th 2019, 12:46:58.554","    2019-12-16T12:46:58.554Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:46:58.554","    2019-12-16T12:46:58.554Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
"December 16th 2019, 12:46:57.551","    2019-12-16T12:46:57.550Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:46:57.550","    2019-12-16T12:46:57.550Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
"December 16th 2019, 12:45:59.478","    2019-12-16T12:45:59.477Z [ERROR] nomad: failed to establish leadership: error=""force run of periodic job ""<ns: \""default\"", id: \""impala-copy\"">"" failed: timed out enqueuing operation"""
"December 16th 2019, 12:45:59.478","    2019-12-16T12:45:59.477Z [ERROR] nomad.periodic: force run of periodic job failed: job=""<ns: ""default"", id: ""impala-copy"">"" error=""timed out enqueuing operation"""
"December 16th 2019, 12:45:59.477","    2019-12-16T12:45:59.477Z [ERROR] nomad.periodic: failed to dispatch job: job=""<ns: ""default"", id: ""impala-copy"">"" error=""timed out enqueuing operation"""
"December 16th 2019, 12:45:29.477","Code: 400. Errors:"
"December 16th 2019, 12:45:29.477","* 1 error occurred:"
"December 16th 2019, 12:45:29.477","* invalid accessor"""
"December 16th 2019, 12:45:29.476","    2019-12-16T12:45:29.476Z [WARN ] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error=""failed to revoke token (alloc: ""f95b0d03-af4e-099d-85ee-34463cd1c2cf"", node: ""2ffb6c3c-48e4-086c-070d-fe549dcdc087"", task: ""mongo-restore-stage""): Error making API request."
"December 16th 2019, 12:45:29.476","URL: POST https://vault.service.consul:8200/v1/auth/token/revoke-accessor"
"December 16th 2019, 12:45:27.480","    2019-12-16T12:45:27.479Z [ERROR] http: request failed: method=GET path=/v1/job/impala-copy/periodic-1576500325?region=aav01&stale= error=""job not found"" code=404"
"December 16th 2019, 12:45:27.479","    2019-12-16T12:45:27.479Z [ERROR] http: request failed: method=GET path=/v1/allocation/b251fbc6-62ca-3587-82df-790ef166832a?region=aav01&stale= error=""alloc not found"" code=404"
"December 16th 2019, 12:45:22.116","    2019-12-16T12:45:22.115Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:22.116","    2019-12-16T12:45:22.115Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:22.115","    2019-12-16T12:45:22.115Z [WARN ] nomad: raft: Failed to contact 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 in 500.288751ms"
"December 16th 2019, 12:45:22.115","    2019-12-16T12:45:22.115Z [WARN ] nomad: raft: Failed to contact 7062e197-58e6-d20e-6122-846f2b198254 in 531.995222ms"
"December 16th 2019, 12:45:22.115","    2019-12-16T12:45:22.115Z [WARN ] nomad: raft: Failed to contact quorum of nodes, stepping down"
"December 16th 2019, 12:45:22.083","    2019-12-16T12:45:22.083Z [WARN ] nomad: raft: Failed to contact 7062e197-58e6-d20e-6122-846f2b198254 in 500.264924ms"
"December 16th 2019, 12:45:20.930","    2019-12-16T12:45:20.930Z [INFO ] nomad: raft: pipelining replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:20.929","    2019-12-16T12:45:20.929Z [INFO ] nomad: raft: pipelining replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:20.927","    2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Election won. Tally: 2"
"December 16th 2019, 12:45:20.927","    2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Added peer 7062e197-58e6-d20e-6122-846f2b198254, starting replication"
"December 16th 2019, 12:45:20.927","    2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Added peer 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9, starting replication"
"December 16th 2019, 12:45:20.927","    2019-12-16T12:45:20.926Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Leader] entering Leader state"
"December 16th 2019, 12:45:20.884","    2019-12-16T12:45:20.883Z [INFO ] nomad: raft: Duplicate RequestVote for same term: 10116"
"December 16th 2019, 12:45:20.186","    2019-12-16T12:45:20.185Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10116"
"December 16th 2019, 12:45:20.186","    2019-12-16T12:45:20.185Z [WARN ] nomad: raft: Heartbeat timeout from """" reached, starting election"
"December 16th 2019, 12:45:19.175","    2019-12-16T12:45:19.175Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:19.175","    2019-12-16T12:45:19.175Z [WARN ] nomad: raft: Failed to contact 7062e197-58e6-d20e-6122-846f2b198254 in 500.071135ms"
"December 16th 2019, 12:45:19.175","    2019-12-16T12:45:19.175Z [WARN ] nomad: raft: Failed to contact quorum of nodes, stepping down"
"December 16th 2019, 12:45:19.175","    2019-12-16T12:45:19.175Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Follower] entering Follower state (Leader: """")"
"December 16th 2019, 12:45:19.175","    2019-12-16T12:45:19.175Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:19.174","    2019-12-16T12:45:19.174Z [WARN ] nomad: raft: Failed to contact 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 in 500.252525ms"
"December 16th 2019, 12:45:18.658","    2019-12-16T12:45:18.658Z [ERROR] worker: failed to dequeue evaluation: error=""eval broker disabled"""
"December 16th 2019, 12:45:18.658","    2019-12-16T12:45:18.658Z [ERROR] worker: failed to dequeue evaluation: error=""eval broker disabled"""
"December 16th 2019, 12:45:18.653","    2019-12-16T12:45:18.652Z [INFO ] nomad: raft: pipelining replication to peer {Voter 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9 172.27.3.86:4647}"
"December 16th 2019, 12:45:18.651","    2019-12-16T12:45:18.650Z [INFO ] nomad: raft: pipelining replication to peer {Voter 7062e197-58e6-d20e-6122-846f2b198254 172.27.0.212:4647}"
"December 16th 2019, 12:45:18.649","    2019-12-16T12:45:18.649Z [INFO ] nomad: cluster leadership acquired"
"December 16th 2019, 12:45:18.649","    2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Added peer 7062e197-58e6-d20e-6122-846f2b198254, starting replication"
"December 16th 2019, 12:45:18.649","    2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Election won. Tally: 2"
"December 16th 2019, 12:45:18.649","    2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Leader] entering Leader state"
"December 16th 2019, 12:45:18.649","    2019-12-16T12:45:18.648Z [INFO ] nomad: raft: Added peer 2d339ebc-1bf5-858b-f088-a3cf0eeafcb9, starting replication"
"December 16th 2019, 12:45:18.638","    2019-12-16T12:45:18.637Z [ERROR] worker: failed to dequeue evaluation: error=""rpc error: eval broker disabled"""
"December 16th 2019, 12:45:18.638","    2019-12-16T12:45:18.637Z [ERROR] worker: failed to dequeue evaluation: error=""rpc error: eval broker disabled"""
"December 16th 2019, 12:45:17.480","    2019-12-16T12:45:17.479Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10115"
"December 16th 2019, 12:45:17.479","    2019-12-16T12:45:17.479Z [WARN ] nomad: raft: Election timeout reached, restarting election"
"December 16th 2019, 12:45:16.282","    2019-12-16T12:45:16.281Z [WARN ] nomad: raft: Election timeout reached, restarting election"
"December 16th 2019, 12:45:16.282","    2019-12-16T12:45:16.282Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10114"
"December 16th 2019, 12:45:14.932","    2019-12-16T12:45:14.931Z [INFO ] nomad: raft: Node at 172.27.5.61:4647 [Candidate] entering Candidate state in term 10113"
"December 16th 2019, 12:45:14.932","    2019-12-16T12:45:14.931Z [WARN ] nomad: raft: Heartbeat timeout from ""172.27.0.212:4647"" reached, starting election"
"December 16th 2019, 12:45:09.082","    2019-12-16T12:45:09.082Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied"""
"December 16th 2019, 12:45:09.082","    2019-12-16T12:45:09.082Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:45:08.079","    2019-12-16T12:45:08.079Z [ERROR] worker: error invoking scheduler: error=""failed to process evaluation: rpc error: Permission denied"""
"December 16th 2019, 12:45:08.079","    2019-12-16T12:45:08.079Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied""" 

So here can be applied approach which i suggest in https://github.com/hashicorp/nomad/issues/4749

tantra35 commented 4 years ago

just reproduced this again and we got sockets stats(ss -nt) and found

CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:38246
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:51068
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:36990
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:43642
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:33904
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:41554
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:51460
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55616
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55814
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:37494
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:35766
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:47904
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:39322
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:34362
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:36104
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:49902
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53078
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:43852
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:59284
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:47666
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:54228
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:41550
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:60240
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:47136
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:40756
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:51652
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:49424
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:49698
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:50358
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:41432
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:39566
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:58876
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53524
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:38688
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:50508
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:48870
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:43610
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:35488
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:52040
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:52048
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:59254
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:36682
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:34342
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:60586
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:48054
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:59092
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55362
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53350
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:49828
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:50510
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:42262
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:46204
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:49416
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53282
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53546
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:58590
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:54466
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53510
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53962
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:56566
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:51554
ESTAB      0      0                    ::ffff:172.27.5.61:8300                            ::ffff:172.27.0.113:38434
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:35716
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:56240
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:38084
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:46098
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:60176
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:39866
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53924
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:39558
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:34628
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:36428
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:56468
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55304
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53460
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:37754
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:40960
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:51934
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:44364
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:41006
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:47722
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:59832
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55414
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:49160
ESTAB      0      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:38386
ESTAB      0      0                    ::ffff:172.27.5.61:4647                            ::ffff:172.27.0.212:38798
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:50972
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53728
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:38264
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:52760
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:34454
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:52726
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:50460
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:41218
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:43300
ESTAB      0      0                    ::ffff:172.27.5.61:8300                              ::ffff:172.27.0.6:30932
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:45744
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:53640
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:35020
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:51466
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55266
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:49838
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:43630
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:51916
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:58798
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:45956
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55448
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:41762
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:36604
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:57096
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:55940
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:34530
CLOSE-WAIT 1      0                    ::ffff:172.27.5.61:4647                             ::ffff:172.27.3.86:47538

so this is not a communication error with worker nodes, but problem in raft interoperation itself

So problem described here is more complex then a think before:

  1. presenting 2 leaders at the same time(with this we fight in 0.8.6 with patch described in #4749, and it appear again for us as in 0.9.x we doesn't apply this modifications)
  2. hung yamux connections(reasons of this we cant find yet)
  3. absolutely ignoring of hashicorp team of reports from community. The only reason why this happens i can imagine to myself is that nomad not very priority product
tgross commented 4 years ago
  1. presenting 2 leaders at the same time(with this we fight in 0.8.6 with patch described in #4749, and it appear again for us as in 0.9.x we doesn't apply this modifications)

That's a pretty serious correctness issue; I'm going to re-open #4749 so we can follow up on that one. Looks like stalebot got overzealous.

  1. hung yamux connections(reasons of this we cant find yet)

I don't see an obvious way this would be related to multiple leaders, ~but if we can get into that state there's a whole mess of problems that'll fall out of that because having exactly one leader is one of our core assertions.~ (Edit: turns out that isn't the case, see my comments in https://github.com/hashicorp/nomad/issues/4749#issuecomment-571253655) If there's a way we could replicate the leader issue it might give us more insight.

  1. absolutely ignoring of hashicorp team of reports from community. The only reason why this happens i can imagine to myself is that nomad not very priority product

I definitely understand your frustration with this. For what it's worth, the Nomad engineering team has been staffed up quite a bit in the last year (especially in the last 6 months). And my particular focus has been around reliability and stability issues. Thanks so much for how much you've been willing to contribute back in extensive bug reports! We'll get there!

tgross commented 4 years ago

Unfortunately it looks like we're missing the original election failover for node 172.27.5.61 in the logs you've provided -- we can see that the node is already in a flapping state and trying to execute leader operations (and failing to do so because it's not the raft leader). It probably doesn't matter how we got into the flapping state in the first place; there's plenty of networking or other reasons why a node might flap and we need to account for that. Like in #4749 we can also see that the node has lost connectivity or has intermittent connectivity to the other two server nodes. It manages to win an election but then times out on heartbeats ~500ms later.

From the output of ss we can see that the buggy node 172.27.5.61 has a bunch of connections in CLOSE_WAIT and all of them are from 172.27.3.86, the new leader. So the pile-up of goroutines isn't because of connections w/ dead nodes (clients make inbound connections with servers anyways, not the other way around, so even if the server somehow had stale data that wouldn't cause a bunch of outbound connections).

That narrows down the places to look quite a bit in terms of trying to reproduce. While I work on that, if you can reproduce while running with debug mode on, could you dump the output of curling /debug/pprof/goroutine?debug=2 to a file?

tgross commented 4 years ago

I've been able to get similar log outputs from test clusters of 3 servers by running the script below on one of the machines. I've also tried adding extra load (by running CPU stress tests on AWS t2 class machines, to trigger CPU throttling). I'm then polling /debug/pprof/goroutine?debug=1 to get the count of goroutines. Leadership flaps and forward progress is slow, but unfortunately this hasn't resulted in a runaway number of goroutines and network connections in CLOSE_WAIT

@tantra35 you've mentioned a couple times both 0.8.x and 0.9.x -- is this happening similarly on both clusters? If not, what's the exact version of the cluster involved? And is there any chance I can get debug-level logs? The raft system does some logging there that might be helpful to us.

#!/bin/bash
set -e

trap cleanup err exit

cleanup() {
    echo
    echo "removing queue disciplines!"
    sudo tc qdisc del dev eth0 root
}

MY_IP=$(ip addr | grep 'state UP' -A2 | tail -n1 | awk '{print $2}' | cut -f1  -d'/')
SERVER_IPS=($(curl -s localhost:4646/v1/status/peers | jq -r '.[] | split(":") | .[0]'))

# clear any old queue disciplines, if any
sudo tc qdisc del dev eth0 root > /dev/null 2>&1 || true

echo "Hit Ctrl-C to exit..."

while :
do
    # configure queue discipline classes
    sudo tc qdisc add dev eth0 root handle 1: prio
    sudo tc qdisc add dev eth0 parent 1:1 handle 10: sfq
    sudo tc qdisc add dev eth0 parent 1:2 handle 20: sfq
    sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem loss 75% 75%

    # attach the filter to the peers
    for SERVER_IP in ${SERVER_IPS[@]}; do
        if [[ "$SERVER_IP" != "$MY_IP" ]]; then
            sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
               match ip dst "${SERVER_IP}/32" flowid 1:3
        fi
    done
    echo -n '.'
    sleep 2

    sudo tc qdisc del dev eth0 root
    sleep 5

done
tantra35 commented 4 years ago

@tgross

  1. our t2 instances never was overloaded, so they never been throttled(despite this we set on them Unlimited (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-unlimited-mode.html))

  2. If you have similar logs, that means that you must doesnt see pairs of messages:

    nomad: cluster leadership lost
    nomad: cluster leadership acquired"

    As you doesn't see any leaked sockets, I am 100% sure, that in your logs you can see this pair, and this is mean that it is not our case. At some moment we doesn't see cluster leadership lost despite wee see in logs that raft leadership is lost

  3. I think that your miss fact that raft leadership detecting by nomad by reading buffered channel leaderCh (https://github.com/hashicorp/nomad/blob/master/nomad/server.go#L1237), with buffer count 1, and when we overload this buffer we have that all raft stuff code will block, so there must happens multiple flapping in that buffer to overlap it. So may be the key, to reproduce is to disable this buffering(i doesn't check this idea), also from this can may follow a solution - increase this buffer to something more bigger

  4. We Apply patch described in #4749, and now doesn't see any problems about 19 days(since 22.12.2019, and graph you can see that before we have issues more often, ofcourse this is not mean that we solve issue fully - need more time to check): ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅ So fo now not possible to provide profiling or debug logs

  5. From old leadership problems we have some errors in logs like this:

"December 16th 2019, 12:45:08.079","    2019-12-16T12:45:08.079Z [ERROR] core.sched: batch job reap failed: error=""rpc error: Permission denied""" 

This is due, leader generate new token(managment) every time when it changes, and this token as i understand remembered in eval, so we have some this buggy evals that will never end up. How we can solve this?

tgross commented 4 years ago

If you have similar logs, that means that you must doesnt see pairs of messages:

nomad: cluster leadership lost
nomad: cluster leadership acquired"

As you doesn't see any leaked sockets, I am 100% sure, that in your logs you can see this pair, and this is mean that it is not our case. At some moment we doesn't see cluster leadership lost despite wee see in logs that raft leadership is lost

I just want to make sure I understand very clearly:

tantra35 commented 4 years ago

We doesn't see nomad: cluster leadership lost on nomad server with leaked sockets, other servers behave as expected

If you see massage pair, this means that in your test case nomad work perfectly and doesn't have any problems, in our case nomad doesn't leave leader state fully, due in logs doesn't appear cluster leadership lost and we have 2 leaders until restart of buggy server.

For example this buggy server does't update cluster state, and if we request for allocation that appear after bug happens, this server response that allocation not found, and other strange things happens when we communicate with cluster(request allocations, collect allocations stat, launch jobs etc), so this looks like split brain

filip-vt commented 4 years ago

Just wanted to add that we are experiencing the same issue. We're making calls to /v1/client/allocation/<alloc-id>/stats and once in a while the call just hangs (does not timeout or return any other error). Besides the fact that the Java SDK does not have any timeout implemented (a problem on its own), the only solution is to do a rolling restart of the nomad servers. For us it happens about once every few days, but thanks to @tantra35's debugging info, I now understand a little bit more why. Not sure if I can provide more info. For now, I was unable to reproduce it.

tantra35 commented 4 years ago

@filip-vt if you full enough brave jeday, you can try our binary https://yadi.sk/d/QnKOvsBaY4fSog linux only

tgross commented 4 years ago

@tantra35 I just wanted to follow up with you on this issue. On Feb 19 we shipped 0.10.4. My colleague @notnoop landed a fix to leadership flapping (https://github.com/hashicorp/nomad/pull/6977), which combined with the RPC fix we shipped in 0.10.3 we believe should fix the problems in this issue and in #4749

tantra35 commented 4 years ago

@tgross We saw all changes that was maded by hashicop team. Good work, but at now we still use 0.9.7(with our custom patch) version due we not actually believe that 0.10.4 is ready for production(we must backport https://github.com/hashicorp/nomad/pull/6324 to 0.10.4), also problem with leaked goroutines solved only partial (we eliminate raft problems, but not eliminate https://github.com/hashicorp/nomad/issues/4604, in that issue i provided more info that we found)

tgross commented 4 years ago

Thanks @tantra35. I'm going to close this issue and we can follow up on any remaining issues in #4606.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.