Closed Miserlou closed 6 years ago
Hey @Miserlou sorry you're experiencing issues again.
Please share the logs for this node as well as other server nodes spanning this incident. Debug level would be especially helpful. Could we also get your nomad server configuration if possible and the version of nomad you're running. As for connections, can you provide the open file handles and what your file handle limit is. Further, can you enable debug = true
in your config and provide a profile of Nomad:
go tool pprof -proto -output=prof.proto http://127.0.0.1:4646/debug/pprof/profile?seconds=30
What is the the state of the system during the screenshot. It appears Nomad only has a few seconds of process time. Is something restarting nomad? Also is one process a child of another or are there actually two nomad agents running on the node? I know some details are spread across a few of your other tickets but a summary of your deployment architecture here would be helpful. How are the servers deployed, what are the machine sizes, rate of the incoming dispatch requests etc. Additionally could you elaborate as to your usecase of Nomad.
I understand the frustration of coming in first thing to a crashed system, but there's really nothing we can do with the information you've provided in the initial ticket. Please be as detailed as possible so we can help as best we can. In the future please follow the issue template when opening issues.
$ go tool pprof -proto -output=prof.proto http://127.0.0.1:4646/debug/pprof/profile?seconds=30
Fetching profile from http://127.0.0.1:4646/debug/pprof/profile?seconds=30
Please wait... (30s)
parsing profile: unrecognized profile format
Viewing the UI directly gives this:
This looks like it was the crash:
I can't enable debug now since this is a prod system and it's already happened. File max is at 3203524
. The SS is after a manual restart, before that it was about 3 days, I don't know how long it had been frozen. By the throughput, I'm guessing it was for most of that time. The server machines are m5.2xlarge
(4x), the workers are x1.32xlarge
(1-10x elastic).
This time it looks like logging stopped, as I can find no actionable logs. In the working directory there is a 226M raft.db
and a 3.0GB state.bin
. The server config is trivial:
log_level = "WARN"
# Setup data dir
data_dir = "/tmp/nomad_server1"
# Enable the server
server {
enabled = true
# Only 1 for the lead server, 3 for the others.
bootstrap_expect = 1
}
consul {
auto_advertise = false
server_auto_join = false
client_auto_join = false
}
Obviously, this isn't the first time that Nomad server failure has called total system collapse: https://github.com/hashicorp/nomad/issues/4323 although this time the outage cost us about $4,000 as it happened undetected over a weekend. I'm responsible for maintaining a system which is completely incapable of doing what it claims. The decision to use this technology was made before I arrived. It is too late to turn around. Every day is hell for me. I feel like the victim of fraud who has no choice but to wake up everyday and be defrauded over and over again.
As a reminder, when this problem occurs, it takes down all servers in the cluster.
Restarting the server manually appears to cause a No cluster leader
problem, so it doesn't look like this will even be solvable by a monit
based solution as the cluster is completely unrecoverable once this occurs. I'd like to try trashing the working directories and starting over but I'll keep them for now in case I can extract information which would be useful for you out of them.
root@ip-10-0-233-82:/home/ubuntu# nomad agent -config server.hcl
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from server.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Client: false
Log Level: WARN
Region: global (DC: dc1)
Server: true
Version: 0.8.3
==> Nomad agent started! Log data will stream in below:
2018/11/13 17:42:54 [WARN] snapshot: Found temporary snapshot: 112-7593582-1541858894814.tmp
2018/11/13 17:45:02 [WARN] serf: Failed to re-join any previously known node
2018/11/13 17:45:03 [WARN] raft: Heartbeat timeout from "" reached, starting election
2018/11/13 17:45:03 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:03 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:05 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:05 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:05 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:07 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:07 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:07 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:08 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:10 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:10 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:10 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:11 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:11 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:11 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:12.427247 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:12.505918 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:12.572498 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:12.607766 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:12.628045 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:12.655025 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:12.817723 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:12.857453 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:13 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:13 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:13 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:14 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:14 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:14 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:16 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:16 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:16 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:17.532251 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:17.751311 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:17.770038 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:17.803579 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:17.836089 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:17.872673 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:17 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:17 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:17 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:17.938675 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:18.027099 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:19 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:19 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:19 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:20 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:20 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:20 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:21 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:21 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:21 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:22.972693 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:23 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:23 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:23 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:23.095370 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:23.142658 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:23.143297 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:23.194273 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:23.321416 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:23.372765 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:23.415345 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:24 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:24 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:24 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:26 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:26 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:26 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:27 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:27 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:27 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:28 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:28 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:28 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:29.402827 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:29.424376 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:29.454753 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:29.558519 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:29.561747 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:29.637734 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:29.661212 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:29.700500 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/11/13 17:45:30 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:30 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:30 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
2018/11/13 17:45:31 [WARN] raft: Election timeout reached, restarting election
2018/11/13 17:45:31 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 17:45:31 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
etc
Restarting all of the other servers gives just:
2018/11/13 17:51:10.824468 [ERR] worker: failed to dequeue evaluation: eval broker disabled
2018/11/13 17:51:11.568314 [ERR] worker: failed to dequeue evaluation: eval broker disabled
etc
I have asked this so many times anybody has ever answered me - how do you load test Nomad? It seems like this is the sort of thing that would have been discovered if you did any serious workload on Nomad at all.
After about 15 minutes it actually did start an election again, but one of the agents simply died in the process for no reason and so everything failed:
2018/11/13 17:52:54.243597 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.001945 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.020823 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.053997 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.174553 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.174625 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.182779 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.190254 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:53:04.244400 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
2018/11/13 17:54:00 [WARN] raft: Heartbeat timeout from "10.0.233.82:4647" reached, starting election
2018/11/13 17:54:27 [WARN] raft: Heartbeat timeout from "10.0.233.82:4647" reached, starting election
2018/11/13 17:54:43 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-56-134.global)
2018/11/13 17:54:43 [ERR] raft-net: Failed to flush response: write tcp 10.0.56.134:4647->10.0.233.82:49250: write: broken pipe
2018/11/13 17:54:45 [WARN] raft: Election timeout reached, restarting election
Killed
Hey @Miserlou I'm working on going through this info. What immediately sticks out at me is in your server config you have:
# Only 1 for the lead server, 3 for the others.
bootstrap_expect = 1
Does this mean this value is set to 3 on other servers and if so why?
Additionally it looks like you're running 4 servers?
The server machines are m5.2xlarge (4x)
This shouldn't be an issue, but 3 servers will actually yield better performance without sacrificing any reliability. You can only loose 1 server in either scenario to maintain quorum but 3 servers means one less participant in raft.
When you say that your workers (I assume nomad clients) are elastic, does this mean they're managed by an autoscaling group. If so do you happen to have a graph of the number of clients during the time of the crash/lockup as well as a graph of the NomadQueueLength
you posted in #4323? Thanks.
Sorry, it is actually 3 servers, not 4.
bootstrap_expect
is set to 3
on the others:
# Enable the server
server {
enabled = true
# Self-elect, should be 3 or 5 for production
bootstrap_expect = 3
# This is the IP address of the first server we provisioned
retry_join = ["X.X.X.X:4648"]
}
This is because it says to in the documentation.
Queue length looks like you'd expect.
I can't seem to find a metric for how many clients were online at the time of the crash. It was probably between 0 and 20. =/
Restarting keeps causing that same Killed
message on the 3rd server client. Not sure if that's related or a separate ticket:
2018/11/13 18:20:32 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
2018/11/13 18:20:49 [WARN] raft: Heartbeat timeout from "10.0.122.116:4647" reached, starting election
2018/11/13 18:21:38 [ERR] memberlist: Push/Pull with ip-10-0-233-82.global failed: write tcp 10.0.56.134:56068->10.0.233.82:4648: i/o timeout
2018/11/13 18:21:59 [ERR] memberlist: Failed to send TCP ack: write tcp 10.0.56.134:4648->10.0.233.82:46372: i/o timeout from=10.0.233.82:46372
2018/11/13 18:22:39 [ERR] yamux: keepalive failed: session shutdown
Killed
bootstrap_expect
should be the same on all nodes. From the docs:
A value of 1 does not provide any fault tolerance and is not recommended for production use cases.
Additionally enabling debug mode and setting the logging level to DEBUG
would really help if this happened again. Debug mode doesn't affect the cluster until you attach the debugger to it then it is a very small amount of performance overhead. Debug logging will produce additional logs, but otherwise won't impact the server and is safe to do in production.
There are several threads to pull at here. If you could provide the logs from the weekend, perhaps via dropbox or S3 if they're too large to share here. If you'd be willing to let us look at the raft.db and state.bin files that could help as well.
That Killed
message literally means something sent a SIGKILL to the process, its not from Nomad.
Would you share a graph of the memory usage as well? Any chance there are OOM killer events in dmesg?
Actually that time it happened on two servers:
2018/11/13 18:23:11 [ERR] raft: peer {Voter 10.0.233.82:4647 10.0.233.82:4647} has newer term, stopping replication
2018/11/13 18:23:55 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
2018/11/13 18:25:44 [ERR] memberlist: Failed to send TCP ack: write tcp 10.0.122.116:4648->10.0.233.82:34610: i/o timeout from=10.0.233.82:34610
2018/11/13 18:25:51 [WARN] raft: Heartbeat timeout from "" reached, starting election
2018/11/13 18:25:53 [ERR] memberlist: Failed to push local state: write tcp 10.0.122.116:4648->10.0.233.82:34512: i/o timeout from=10.0.233.82:34512
2018/11/13 18:26:05 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
2018/11/13 18:26:25 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
2018/11/13 18:27:21 [ERR] memberlist: Failed to send TCP ack: write tcp 10.0.122.116:4648->10.0.233.82:34920: i/o timeout from=10.0.233.82:34920
2018/11/13 18:27:21 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
Killed
It's way, way, way too noisy to use DEBUG on our production systems - especially when this particular error occurs, it generated gigabytes of log output in a few seconds.
It does look like those Killed
messages are system OOMs:
[3636251.619735] Out of memory: Kill process 3870 (nomad) score 944 or sacrifice child
[3636251.622470] Killed process 3870 (nomad) total-vm:32370228kB, anon-rss:31232628kB, file-rss:0kB
CWL doesn't automatically track memory usage over time (AFAIK), so I can't give you that graph.
"Throw more memory at it" was our solution to this problem last time but I have to say that I'm very unhappy with that as we still have the problem, only now it's more expensive.
Progress! So Nomad is running out of memory. Now lets try a few things to figure out why. First if you are able to share the state.bin and raft.db files that would help.
Try grepping for this in your server logs of the incident:
attempting to apply large raft entry
Also could you share the job you're dispatching?
Do you have graphs for all of Nomad's metrics? Seeing the following would be really helpful:
nomad.broker.*
nomad.blocked_evals.*
nomad.worker.*
We have lots of different jobs, many of them are in this directory: https://github.com/AlexsLemonade/refinebio/blob/dev/workers/nomad-job-specs/salmon.nomad.tpl - that's a template, it ends up being about 70 job specs total at registration time.
I'm not sure about sharing the files because a) they're very big and b) I don't know if they're safe to share. What information do they contain? Obviously I want to make sure that they don't contain any keys or environment variables that are passed around.
May plan is to just implement a governor to put a ceiling on the number of dispatch jobs at once and add monit
to restart server processes that fail. However, I am not confident that it will be enough to get around the underlying failure.
How do I get those values you requested? We have disabled telemetry due to it being significantly worse than useless for dispatch jobs.
fatal error: runtime: out of memory
runtime stack:
runtime.throw(0x15b4a85, 0x16)
/usr/local/go/src/runtime/panic.go:605 +0x95
runtime.sysMap(0xcb294c0000, 0xbf40000, 0x20bb200, 0x20d4978)
/usr/local/go/src/runtime/mem_linux.go:216 +0x1d0
runtime.(*mheap).sysAlloc(0x20baa60, 0xbf40000, 0x7fd55913f428)
/usr/local/go/src/runtime/malloc.go:470 +0xd7
runtime.(*mheap).grow(0x20baa60, 0x5fa0, 0x0)
/usr/local/go/src/runtime/mheap.go:887 +0x60
runtime.(*mheap).allocSpanLocked(0x20baa60, 0x5fa0, 0x20d4988, 0x7fd5590f9518)
/usr/local/go/src/runtime/mheap.go:800 +0x334
runtime.(*mheap).alloc_m(0x20baa60, 0x5fa0, 0xffffffffffff0100, 0x7fd5b5ffadf8)
/usr/local/go/src/runtime/mheap.go:666 +0x118
runtime.(*mheap).alloc.func1()
/usr/local/go/src/runtime/mheap.go:733 +0x4d
runtime.systemstack(0x7fd5b5ffae10)
/usr/local/go/src/runtime/asm_amd64.s:360 +0xab
runtime.(*mheap).alloc(0x20baa60, 0x5fa0, 0x7fd5b5010100, 0x4168b4)
/usr/local/go/src/runtime/mheap.go:732 +0xa1
runtime.largeAlloc(0xbf40000, 0x7fd5bf200001, 0xcb2928d209)
/usr/local/go/src/runtime/malloc.go:827 +0x98
runtime.mallocgc.func1()
/usr/local/go/src/runtime/malloc.go:722 +0x46
runtime.systemstack(0xc420024000)
/usr/local/go/src/runtime/asm_amd64.s:344 +0x79
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1135
goroutine 1682 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:298 fp=0xc6d87f0898 sp=0xc6d87f0890 pc=0x45c440
runtime.mallocgc(0xbf40000, 0x1455380, 0x1ffe701, 0x3)
/usr/local/go/src/runtime/malloc.go:721 +0x7ae fp=0xc6d87f0940 sp=0xc6d87f0898 pc=0x412f2e
runtime.newarray(0x1455380, 0x44000, 0xc42001e070)
/usr/local/go/src/runtime/malloc.go:853 +0x60 fp=0xc6d87f0970 sp=0xc6d87f0940 pc=0x413290
runtime.makeBucketArray(0x137a960, 0xc6d87f0a12, 0xc60daca340, 0xc5e0368180)
/usr/local/go/src/runtime/hashmap.go:927 +0xf5 fp=0xc6d87f09b8 sp=0xc6d87f0970 pc=0x40c1d5
runtime.hashGrow(0x137a960, 0xc8cfdd9a10)
/usr/local/go/src/runtime/hashmap.go:951 +0xa3 fp=0xc6d87f0a18 sp=0xc6d87f09b8 pc=0x40c393
runtime.mapassign_faststr(0x137a960, 0xc8cfdd9a10, 0xcb29406fc0, 0x6f, 0xcb293ffd80)
/usr/local/go/src/runtime/hashmap_fast.go:666 +0x28d fp=0xc6d87f0a98 sp=0xc6d87f0a18 pc=0x40eedd
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSink).SetGaugeWithLabels(0xc4201c6190, 0xcb29408ba0, 0x4, 0x6, 0xcb00000000, 0xcb293ffd80, 0x3, 0x4)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem.go:167 +0x138 fp=0xc6d87f0b70 sp=0xc6d87f0a98 pc=0x7756d8
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).SetGaugeWithLabels(0xc4205180b0, 0xc775439110, 0x4, 0x6, 0xc600000000, 0xc865eff2c0, 0x3, 0x4)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/metrics.go:41 +0x345 fp=0xc6d87f0c78 sp=0xc6d87f0b70 pc=0x779685
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.SetGaugeWithLabels(0xc775439110, 0x3, 0x3, 0xca00000000, 0xc865eff2c0, 0x2, 0x2)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/start.go:96 +0xce fp=0xc6d87f0cd8 sp=0xc6d87f0c78 pc=0x77d7ae
github.com/hashicorp/nomad/nomad.(*Server).publishJobSummaryMetrics(0xc4201fa280, 0xc91388c780)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:634 +0xeb7 fp=0xc6d87f0fd0 sp=0xc6d87f0cd8 pc=0x100e227
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc6d87f0fd8 sp=0xc6d87f0fd0 pc=0x45f071
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:239 +0x485
goroutine 1 [select, 16 minutes]:
github.com/hashicorp/nomad/command/agent.(*Command).handleSignals(0xc4201e76c0, 0x160c940)
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:562 +0x22d
github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc4201e76c0, 0xc4200100e0, 0x2, 0x2, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:551 +0xd45
github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc4201f8640, 0xc4201f8640, 0xc4201bc3a0, 0x37)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli/cli.go:255 +0x1eb
main.RunCustom(0xc4200100d0, 0x3, 0x3, 0xc4204f5f60)
/opt/gopath/src/github.com/hashicorp/nomad/main.go:127 +0x4c9
main.Run(0xc4200100d0, 0x3, 0x3, 0x11ce214)
/opt/gopath/src/github.com/hashicorp/nomad/main.go:72 +0x3f
main.main()
/opt/gopath/src/github.com/hashicorp/nomad/main.go:68 +0x63
goroutine 5 [syscall, 18 minutes]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:131 +0xa6
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:28 +0x41
goroutine 82 [select]:
github.com/hashicorp/nomad/helper/pool.(*ConnPool).reap(0xc42033b3e0)
/opt/gopath/src/github.com/hashicorp/nomad/helper/pool/pool.go:439 +0x37b
created by github.com/hashicorp/nomad/helper/pool.NewPool
/opt/gopath/src/github.com/hashicorp/nomad/helper/pool/pool.go:170 +0x179
goroutine 32 [select]:
github.com/hashicorp/nomad/command/agent/consul.checkConsulTLSSkipVerify(0x200f200, 0xc4201cc040, 0xc4201c6230, 0x2017de0, 0xc4200d7c60, 0xc42001a180)
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/version_checker.go:42 +0x1ba
created by github.com/hashicorp/nomad/command/agent/consul.(*ServiceClient).Run
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/client.go:287 +0x12c
goroutine 16 [select, 18 minutes]:
github.com/hashicorp/nomad/command/agent/consul.(*ServiceClient).Run(0xc4201d43c0)
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/client.go:293 +0x26a
created by github.com/hashicorp/nomad/command/agent.(*Agent).setupConsul
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/agent.go:950 +0x145
goroutine 15 [sleep]:
time.Sleep(0x3b9aca00)
/usr/local/go/src/runtime/time.go:65 +0x130
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).collectStats(0xc4205180b0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/metrics.go:168 +0x2f
created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.New
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/start.go:75 +0x12b
goroutine 14 [select, 18 minutes]:
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSignal).run(0xc4201c0280)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:64 +0xf8
created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.NewInmemSignal
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:38 +0x162
goroutine 13 [select, 16 minutes, locked to thread]:
runtime.gopark(0x1610910, 0x0, 0x159dab6, 0x6, 0x18, 0x1)
/usr/local/go/src/runtime/proc.go:287 +0x12c
runtime.selectgo(0xc4204a2f50, 0xc4202a2180)
/usr/local/go/src/runtime/select.go:395 +0x1149
runtime.ensureSigM.func1()
/usr/local/go/src/runtime/signal_unix.go:511 +0x220
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2337 +0x1
goroutine 1488 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530d980)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f
goroutine 83 [select, 6 minutes]:
github.com/hashicorp/nomad/nomad.(*RaftLayer).Accept(0xc4201c6500, 0x160ddc0, 0xc4205184d0, 0x200f200, 0xc4203af640)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/raft_rpc.go:63 +0xf8
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).listen(0xc4205184d0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:466 +0x49
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.NewNetworkTransportWithConfig
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:166 +0x1c1
goroutine 87 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).leaderLoop(0xc420252000)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:500 +0x458
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runLeader(0xc420252000)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:420 +0x3f4
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).run(0xc420252000)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:140 +0x67
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.run)-fm()
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/api.go:505 +0x2a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc6d8483140)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 88 [running]:
goroutine running on other thread; stack unavailable
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 89 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runSnapshots(0xc420252000)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/snapshot.go:71 +0x166
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.runSnapshots)-fm()
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/api.go:507 +0x2a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc6d8483160)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 90 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*serfQueries).stream(0xc66fa50a50)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/internal_query.go:80 +0xf7
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.newSerfQueries
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/internal_query.go:73 +0xd1
goroutine 91 [runnable]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Snapshotter).stream(0xc6d5d51380)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/snapshot.go:189 +0x248
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.NewSnapshotter
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/snapshot.go:131 +0x5c3
goroutine 92 [IO wait]:
internal/poll.runtime_pollWait(0x7fd5bf1aed30, 0x72, 0xffffffffffffffff)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6d73ecd18, 0x72, 0xc6d88d5d00, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6d73ecd18, 0xffffffffffffff00, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc6d73ecd00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:335 +0x1e2
net.(*netFD).accept(0xc6d73ecd00, 0xc7623b2000, 0x0, 0x0)
/usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc6d72e1358, 0xf4a324, 0x45b7b0, 0xc6d88d5f90)
/usr/local/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).AcceptTCP(0xc6d72e1358, 0x160dd78, 0xc6d73e3950, 0xc7623b2000)
/usr/local/go/src/net/tcpsock.go:234 +0x49
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).tcpListen(0xc6d73e3950)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/net.go:204 +0x33
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.newMemberlist
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/memberlist.go:144 +0x7a9
goroutine 93 [IO wait]:
internal/poll.runtime_pollWait(0x7fd5bf1aef70, 0x72, 0x0)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6d73ecd98, 0x72, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6d73ecd98, 0xcaca0c2000, 0x10000, 0x10000)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).ReadFrom(0xc6d73ecd80, 0xcaca0c2000, 0x10000, 0x10000, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:170 +0x19a
net.(*netFD).readFrom(0xc6d73ecd80, 0xcaca0c2000, 0x10000, 0x10000, 0xc6d73e3950, 0xc768cfa000, 0x96, 0x10000, 0x412a57)
/usr/local/go/src/net/fd_unix.go:208 +0x5e
net.(*UDPConn).readFrom(0xc6d72e1360, 0xcaca0c2000, 0x10000, 0x10000, 0xc6d8879e30, 0xc6d8879e88, 0x101, 0xc6d8879ea8)
/usr/local/go/src/net/udpsock_posix.go:47 +0x6a
net.(*UDPConn).ReadFrom(0xc6d72e1360, 0xcaca0c2000, 0x10000, 0x10000, 0x10000, 0x10000, 0x695f, 0x1066112c0f8, 0x20af600)
/usr/local/go/src/net/udpsock.go:118 +0x72
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).udpListen(0xc6d73e3950)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/net.go:302 +0x12b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.newMemberlist
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/memberlist.go:145 +0x7ce
goroutine 94 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).udpHandler(0xc6d73e3950)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/net.go:400 +0x11a
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.newMemberlist
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/memberlist.go:146 +0x7f3
goroutine 95 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).triggerFunc(0xc6d73e3950, 0x12a05f200, 0xc4b7713f80, 0xc64d314060, 0xc6d8483690)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:116 +0x1d0
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).schedule
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:82 +0x3fb
goroutine 96 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).pushPullTrigger(0xc6d73e3950, 0xc64d314060)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:143 +0x21b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).schedule
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:88 +0x349
goroutine 97 [runnable]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).triggerFunc(0xc6d73e3950, 0x1dcd6500, 0xc4b7388000, 0xc64d314060, 0xc6d84836a0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:116 +0x1d0
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).schedule
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:94 +0x1f3
goroutine 130 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).handleReap(0xc57d243800)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1406 +0x168
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:407 +0xa36
goroutine 131 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).handleReconnect(0xc57d243800)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1424 +0x115
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:408 +0xa5b
goroutine 132 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).checkQueueDepth(0xc57d243800, 0x159ca84, 0x6, 0xc66fb45020)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1523 +0x108
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:409 +0xa9e
goroutine 133 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).checkQueueDepth(0xc57d243800, 0x159b5a3, 0x5, 0xc66fb450e0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1523 +0x108
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:410 +0xae4
goroutine 134 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).checkQueueDepth(0xc57d243800, 0x159b76f, 0x5, 0xc66fb45110)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1523 +0x108
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:411 +0xb2a
goroutine 136 [semacquire]:
sync.runtime_notifyListWait(0xc60c171d90, 0x0)
/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171d80)
/usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388060)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388060, 0x1dcd6500, 0x24, 0xc5b9ebf440, 0x24, 0x1, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388060)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 137 [semacquire]:
sync.runtime_notifyListWait(0xc60c171dd0, 0x0)
/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171dc0)
/usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388120)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388120, 0x1dcd6500, 0x24, 0xc5b13bc630, 0x24, 0x1, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388120)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 138 [semacquire]:
sync.runtime_notifyListWait(0xc60c171e10, 0x0)
/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171e00)
/usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388180)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388180, 0x1dcd6500, 0x24, 0xc5a951a060, 0x24, 0x1, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388180)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 139 [semacquire]:
sync.runtime_notifyListWait(0xc60c171ed0, 0x0)
/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171ec0)
/usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b73881e0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b73881e0, 0x1dcd6500, 0x24, 0xc5aa3a0420, 0x24, 0x1, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b73881e0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 140 [semacquire]:
sync.runtime_notifyListWait(0xc60c171f10, 0x0)
/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171f00)
/usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b73882a0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b73882a0, 0x1dcd6500, 0x24, 0xc5b0a176e0, 0x24, 0x1, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b73882a0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 141 [semacquire]:
sync.runtime_notifyListWait(0xc60c171f90, 0x0)
/usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171f80)
/usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388300)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388300, 0x1dcd6500, 0x24, 0xc5b13bd230, 0x24, 0x1, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388300)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 142 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc902d70200, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc482077806, 0x1420740, 0xc82205aae0, 0x24, 0x0, 0x0, 0x20af600, 0xc89b5b0910)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc82205aae0, 0xc89b5b0910, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc42051c548, 0x13, 0x159a5af, 0x4, 0xc6ae70da10, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc42051c548, 0x13, 0xc6ae70da10, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc4201c04c0, 0xc4201c6460, 0xc89b5b08f8, 0xc420533000, 0xc668312000, 0x1420740, 0xc82205aae0, 0x16, 0x12903c0, 0xc89b5b0910, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc4201c6460, 0x200f540, 0xc8367e4a80, 0x13c9200, 0xc6ae70db01)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc4201fa280, 0x15a3a15, 0xb, 0x1420740, 0xc82205aa80, 0x12903c0, 0xc89b5b08f0, 0x20af600, 0xc78f316000)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1382 +0xc4
github.com/hashicorp/nomad/nomad.(*Worker).UpdateEval(0xc4b7388360, 0xc7f8fac160, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:388 +0x2c4
github.com/hashicorp/nomad/scheduler.setStatus(0xc4201c6230, 0x200f5c0, 0xc4b7388360, 0xc48207e6e0, 0x0, 0x0, 0x0, 0x15a03b7, 0x8, 0x0, ...)
/opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:455 +0x25e
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc89549a460, 0xc48207e6e0, 0xc4201c6230, 0x2017d20)
/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:174 +0x925
github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc4b7388360, 0xc48207e6e0, 0xc5dae74480, 0x24, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:292 +0x460
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388360)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:127 +0x12a
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 143 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc6fa1f2100, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc48209b406, 0x1420740, 0xc6fb9da600, 0x24, 0x0, 0x0, 0x20af600, 0xc7e6c54260)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc6fb9da600, 0xc7e6c54260, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc42051c548, 0x13, 0x159a5af, 0x4, 0xc6d8863a10, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc42051c548, 0x13, 0xc6d8863a10, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc4201c04c0, 0xc4201c6460, 0xc7e6c54258, 0xc420533000, 0xc668376000, 0x1420740, 0xc6fb9da600, 0x16, 0x12903c0, 0xc7e6c54260, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc4201c6460, 0x200f540, 0xc7218f8500, 0x13c9200, 0xc6d8863b01)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc4201fa280, 0x15a3a15, 0xb, 0x1420740, 0xc6fb9da5a0, 0x12903c0, 0xc7e6c54250, 0x20af600, 0xc78f316000)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1382 +0xc4
github.com/hashicorp/nomad/nomad.(*Worker).UpdateEval(0xc4b73883c0, 0xc72e188000, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:388 +0x2c4
github.com/hashicorp/nomad/scheduler.setStatus(0xc4201c6230, 0x200f5c0, 0xc4b73883c0, 0xc4820a0420, 0x0, 0x0, 0x0, 0x15a03b7, 0x8, 0x0, ...)
/opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:455 +0x25e
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc86fef8000, 0xc4820a0420, 0xc4201c6230, 0x2017d20)
/opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:174 +0x925
github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc4b73883c0, 0xc4820a0420, 0xc5d02c42a0, 0x24, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:292 +0x460
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b73883c0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:127 +0x12a
created by github.com/hashicorp/nomad/nomad.NewWorker
/opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b
goroutine 144 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*Server).monitorLeadership(0xc4201fa280)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:50 +0x10f
created by github.com/hashicorp/nomad/nomad.NewServer
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:371 +0x160d
goroutine 145 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*Server).serfEventHandler(0xc4201fa280)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/serf.go:28 +0xfe
created by github.com/hashicorp/nomad/nomad.NewServer
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:374 +0x1632
goroutine 146 [IO wait, 1 minutes]:
internal/poll.runtime_pollWait(0x7fd5bf1aeeb0, 0x72, 0xffffffffffffffff)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4201dd098, 0x72, 0xc6d84d4d00, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc4201dd098, 0xffffffffffffff00, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc4201dd080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:335 +0x1e2
net.(*netFD).accept(0xc4201dd080, 0x0, 0x1, 0x1)
/usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc42051cdf0, 0xc6d84d4f18, 0x40712a, 0xc64d31c0c0)
/usr/local/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc42051cdf0, 0xc64d31c0c0, 0x0, 0x3f800000, 0xc60c19c0c0)
/usr/local/go/src/net/tcpsock.go:247 +0x49
github.com/hashicorp/nomad/nomad.(*Server).listen(0xc4201fa280, 0x200f200, 0xc60c19c0c0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:77 +0xae
created by github.com/hashicorp/nomad/nomad.(*Server).startRPCListener
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:405 +0x92
goroutine 147 [select]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).EmitStats(0xc4203303c0, 0x3b9aca00, 0xc4202a2720)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:818 +0x2f6
created by github.com/hashicorp/nomad/nomad.NewServer
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:380 +0x1685
goroutine 148 [select]:
github.com/hashicorp/nomad/nomad.(*PlanQueue).EmitStats(0xc4201c6410, 0x3b9aca00, 0xc4202a2720)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_queue.go:200 +0x123
created by github.com/hashicorp/nomad/nomad.NewServer
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:383 +0x16c7
goroutine 149 [select]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).EmitStats(0xc420134240, 0x3b9aca00, 0xc4202a2720)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:601 +0x123
created by github.com/hashicorp/nomad/nomad.NewServer
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:386 +0x1709
goroutine 150 [select]:
github.com/hashicorp/nomad/nomad.(*vaultClient).EmitStats(0xc420073380, 0x3b9aca00, 0xc4202a2720)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/vault.go:1210 +0x123
created by github.com/hashicorp/nomad/nomad.NewServer
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:389 +0x1750
goroutine 151 [select]:
github.com/hashicorp/nomad/nomad.(*Server).heartbeatStats(0xc4201fa280)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/heartbeat.go:169 +0x133
created by github.com/hashicorp/nomad/nomad.NewServer
/opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:392 +0x1775
goroutine 152 [IO wait, 16 minutes]:
internal/poll.runtime_pollWait(0x7fd5bf1aedf0, 0x72, 0xffffffffffffffff)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6d73ed018, 0x72, 0xc6d88e8c00, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6d73ed018, 0xffffffffffffff00, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc6d73ed000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:335 +0x1e2
net.(*netFD).accept(0xc6d73ed000, 0x4131b8, 0x30, 0x148af00)
/usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc6d72e1580, 0x6052ca, 0x148af00, 0xc66fb73b60)
/usr/local/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc6d72e1580, 0xc4200140b8, 0x1363620, 0x1fe7390, 0x1563f60)
/usr/local/go/src/net/tcpsock.go:247 +0x49
net/http.(*Server).Serve(0xc6d5d51790, 0x200d600, 0xc6d72e1580, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:2695 +0x1b2
net/http.Serve(0x200d600, 0xc6d72e1580, 0x2004740, 0xc66fb73a10, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:2323 +0x73
github.com/hashicorp/nomad/command/agent.NewHTTPServer.func1(0xc60c19c100, 0x200d600, 0xc6d72e1580, 0xc6d8483a60, 0xc66fb73890)
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:111 +0x82
created by github.com/hashicorp/nomad/command/agent.NewHTTPServer
/opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:109 +0x40f
goroutine 1541 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7052bc100, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5e08d2b06, 0x1420740, 0xc76fedc600, 0x24, 0x0, 0x0, 0x20af600, 0xc81c73aaa8)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc76fedc600, 0xc81c73aaa8, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc718547d48, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc718547d48, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73a648, 0xc6e473de80, 0xc8a14ee1e0, 0x1420740, 0xc76fedc600, 0x16, 0x12903c0, 0xc81c73aaa8, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc835e660c0, 0x0, 0x0)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc653508000, 0xc5bdc80190)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf
goroutine 1622 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad/drainer.(*NodeDrainer).run(0xc6d73e3b30, 0x200f200, 0xc8528b7e00)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drainer.go:218 +0x20a
created by github.com/hashicorp/nomad/nomad/drainer.(*NodeDrainer).SetEnabled
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drainer.go:189 +0xe7
goroutine 1487 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicateTo(0xc420252000, 0xc653c9f5c0, 0x74ab25, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:171 +0x82d
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate(0xc420252000, 0xc653c9f5c0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:135 +0x274
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).startStopReplication.func1()
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:451 +0x33
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a1651f20)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 694 [IO wait, 1 minutes]:
internal/poll.runtime_pollWait(0x7fd576803740, 0x72, 0x0)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4abac0918, 0x72, 0xffffffffffffff00, 0x2005240, 0x1ffc978)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc4abac0918, 0xc8610fd000, 0x1000, 0x1000)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4abac0900, 0xc8610fd000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:126 +0x18a
net.(*netFD).Read(0xc4abac0900, 0xc8610fd000, 0x1000, 0x1000, 0xc4201cc080, 0x3, 0xc420022c00)
/usr/local/go/src/net/fd_unix.go:202 +0x52
net.(*conn).Read(0xc555648010, 0xc8610fd000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:176 +0x6d
bufio.(*Reader).fill(0xc4b985e060)
/usr/local/go/src/bufio/bufio.go:97 +0x11a
bufio.(*Reader).ReadByte(0xc4b985e060, 0x3, 0xc8f8752000, 0xc420022c70)
/usr/local/go/src/bufio/bufio.go:239 +0x39
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).handleCommand(0xc4205184d0, 0xc4b985e060, 0xc4b985e0c0, 0xc7ff998070, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:515 +0x43
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).handleConn(0xc4205184d0, 0x200f200, 0xc4203af640, 0x20180e0, 0xc555648010)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:499 +0x243
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).listen
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:477 +0x1eb
goroutine 1486 [runnable]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).pipelineReplicate(0xc420252000, 0xc653c9f380, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:393 +0x49f
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate(0xc420252000, 0xc653c9f380)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:152 +0x2b5
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).startStopReplication.func1()
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:451 +0x33
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a1651f00)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 1520 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).AcceptStream(0xc64485e8f0, 0xc6ae70bd08, 0xc6ae70bcf8, 0x102eacf)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:207 +0xf2
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).Accept(0xc64485e8f0, 0x160cac8, 0xc4201fa280, 0x200f200, 0xc60c19c0c0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:197 +0x2b
github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x20180e0, 0xc68eafc010, 0xc5bdc800a0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:211 +0x2cd
github.com/hashicorp/nomad/nomad.(*Server).handleConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x20180e0, 0xc68eafc010, 0xc5bdc800a0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:137 +0x309
created by github.com/hashicorp/nomad/nomad.(*Server).listen
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:93 +0x15b
goroutine 1564 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).serverHealthLoop(0xc5dd049f40)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:333 +0x163
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).Start
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:81 +0x15e
goroutine 1621 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad/drainer.(*deadlineHeap).watch(0xc8528b7f00)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drain_heap.go:61 +0x1fc
created by github.com/hashicorp/nomad/nomad/drainer.NewDeadlineHeap
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drain_heap.go:46 +0x130
goroutine 1583 [select]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).waitForSchedulers(0xc4203303c0, 0xc6d88ebf30, 0x1, 0x1, 0xcb0b8fe180, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:483 +0x2c0
github.com/hashicorp/nomad/nomad.(*EvalBroker).Dequeue(0xc4203303c0, 0xc6d88ebf30, 0x1, 0x1, 0x3b9aca00, 0x0, 0x0, 0x0, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:348 +0x106
github.com/hashicorp/nomad/nomad.(*Server).reapFailedEvaluations(0xc4201fa280, 0xc91388c780)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:501 +0xb7
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:230 +0x3ef
goroutine 1582 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*Server).schedulePeriodic(0xc4201fa280, 0xc91388c780)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:454 +0x2c8
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:227 +0x3bd
goroutine 1561 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*netPipeline).decodeResponses(0xc8528b7000)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:667 +0x11b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.newNetPipeline
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:658 +0x117
goroutine 1578 [chan receive, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7f6944e00, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc80e136706, 0x1420740, 0xc8687ebb60, 0x24, 0x0, 0x0, 0x20af600, 0xc8f913a918)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc8687ebb60, 0xc8f913a918, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc94698fd48, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc94698fd48, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc8f913a598, 0xc6e473de80, 0xc8a62f3140, 0x1420740, 0xc8687ebb60, 0x16, 0x12903c0, 0xc8f913a918, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc83eba3480, 0x0, 0x0)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc65436d500, 0xc5bdc80190)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf
goroutine 1489 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530d9e0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f
goroutine 1604 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc70a052080, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5d0446606, 0x1420740, 0xc707654120, 0x24, 0x0, 0x0, 0x20af600, 0xc8564aa428)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc707654120, 0xc8564aa428, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc6d8869d48, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc6d8869d48, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73a6c8, 0xc6e473de80, 0xc8a3349540, 0x1420740, 0xc707654120, 0x16, 0x12903c0, 0xc8564aa428, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc798d04200, 0x0, 0x0)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc653afc000, 0xc5bdc80190)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf
goroutine 1616 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7052bc400, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5e08d3706, 0x1420740, 0xc76fedccc0, 0x24, 0x0, 0x0, 0x20af600, 0xc81c73b658)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc76fedccc0, 0xc81c73b658, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc8386efd48, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc8386efd48, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73b2c0, 0xc6e473de80, 0xc6a6305300, 0x1420740, 0xc76fedccc0, 0x16, 0x12903c0, 0xc81c73b658, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc85e7a6b40, 0x0, 0xc537843bc0)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc537843bc0, 0xc5bdc80190)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf
goroutine 1585 [select]:
github.com/hashicorp/nomad/nomad.(*Server).periodicUnblockFailedEvals(0xc4201fa280, 0xc91388c780)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:577 +0x137
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:236 +0x453
goroutine 1635 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530daa0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f
goroutine 1568 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).prune(0xc420134240)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:619 +0x141
created by github.com/hashicorp/nomad/nomad.(*BlockedEvals).SetEnabled
/opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:136 +0x93
goroutine 1588 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc70a350080, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5d0466606, 0x1420740, 0xc94f79a300, 0x24, 0x0, 0x0, 0x20af600, 0xc82a976598)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc94f79a300, 0xc82a976598, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc8386f1d48, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc8386f1d48, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc8564aa0a8, 0xc6e473de80, 0xc8a62cf8c0, 0x1420740, 0xc94f79a300, 0x16, 0x12903c0, 0xc82a976598, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc82eb71700, 0x0, 0x0)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc653728000, 0xc5bdc80190)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf
goroutine 1584 [select]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).GetDuplicates(0xc420134240, 0x3b9aca00, 0x0, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:553 +0x1c8
github.com/hashicorp/nomad/nomad.(*Server).reapDupBlockedEvaluations(0xc4201fa280, 0xc91388c780)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:546 +0x7a
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:233 +0x421
goroutine 1581 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).run(0xc42033b500, 0x200f200, 0xc64d0e0600)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:334 +0x22e
created by github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).SetEnabled
/opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:175 +0x11c
goroutine 1619 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew(0x200f200, 0xc8528b7e80, 0xc8386f5748, 0x20, 0x20, 0x0, 0x6)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 +0x984
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0xc7a0b84000, 0x200f200, 0xc8528b7e80, 0x0, 0x74aa05)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch.go:82 +0x129
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc4203fda00, 0xc8386f5940, 0x74aa05, 0x200f200, 0xc8528b7e80, 0xc569e91528, 0xc8386f5930, 0xc8386f5950, 0xfc20df, 0xc701a28138)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:146 +0x19f
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).getJobAllocs(0xc701a280e0, 0x200f200, 0xc8528b7e80, 0x74aa05, 0x1, 0x1, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:423 +0xe2
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc701a280e0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:145 +0xd96
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0
goroutine 1615 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7052bc300, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5e08d3206, 0x1420740, 0xc76fedca20, 0x24, 0x0, 0x0, 0x20af600, 0xc81c73b0c8)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc76fedca20, 0xc81c73b0c8, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc6d891bd48, 0x3, 0x3, 0x13c9200, 0x19, ...)
/usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc6d891bd48, 0x3, 0x3, 0x16, 0x16, 0x16)
/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73ad18, 0xc6e473de80, 0xc6a6304c80, 0x1420740, 0xc76fedca20, 0x16, 0x12903c0, 0xc81c73b0c8, ...)
/usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc85e7a67c0, 0x0, 0x0)
/usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc537843a40, 0xc5bdc80190)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf
goroutine 1575 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).heartbeat(0xc420252000, 0xc653c9f5c0, 0xc93dfcb4a0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:351 +0x550
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate.func1()
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:118 +0x3c
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a62f2f20)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 1519 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).heartbeat(0xc420252000, 0xc653c9f380, 0xc93fd44000)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:340 +0x2cc
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate.func1()
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:118 +0x3c
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a3349320)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 1603 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).keepalive(0xc64485e8f0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:308 +0x102
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:111 +0x455
goroutine 1567 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).watchCapacity(0xc420134240)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:415 +0x136
created by github.com/hashicorp/nomad/nomad.(*BlockedEvals).SetEnabled
/opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:135 +0x71
goroutine 1620 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew(0x200f200, 0xc8528b7e00, 0xc9affeda70, 0x20, 0x20, 0x7fd5bf20c1e8, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 +0x984
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0xc7a09d0390, 0x200f200, 0xc8528b7e00, 0xc7a09d0480, 0x749350)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch.go:82 +0x129
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc4203fda00, 0xc9affedc68, 0x749350, 0x200f200, 0xc8528b7e00, 0x561c2e, 0xc4201c6230, 0x2, 0xc5751d2440, 0x40)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:146 +0x19f
github.com/hashicorp/nomad/nomad/drainer.(*nodeDrainWatcher).getNodes(0xc8528b7ec0, 0x749350, 0x3b, 0xc6d8938de8, 0x1, 0x1)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_nodes.go:213 +0xd0
github.com/hashicorp/nomad/nomad/drainer.(*nodeDrainWatcher).watch(0xc8528b7ec0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_nodes.go:150 +0x7f3
created by github.com/hashicorp/nomad/nomad/drainer.NewNodeDrainWatcher
/opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_nodes.go:141 +0xd2
goroutine 1634 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530da40)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f
goroutine 1565 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*PlanQueue).Dequeue(0xc4201c6410, 0x0, 0x0, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_queue.go:155 +0x162
github.com/hashicorp/nomad/nomad.(*Server).planApply(0xc4201fa280)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply.go:59 +0xbf
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:192 +0x20e
goroutine 1562 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).pipelineDecode(0xc420252000, 0xc653c9f380, 0x200c580, 0xc8528b7000, 0xc91388c960, 0xc91388c9c0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:448 +0x14f
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).pipelineReplicate.func1()
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:385 +0x57
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8528b7040)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66
goroutine 1521 [IO wait]:
internal/poll.runtime_pollWait(0x7fd576803680, 0x72, 0x0)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6e473c518, 0x72, 0xffffffffffffff00, 0x2005240, 0x1ffc978)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6e473c518, 0xc9687fa000, 0x1000, 0x1000)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc6e473c500, 0xc9687fa000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:126 +0x18a
net.(*netFD).Read(0xc6e473c500, 0xc9687fa000, 0x1000, 0x1000, 0x0, 0x2a5, 0x0)
/usr/local/go/src/net/fd_unix.go:202 +0x52
net.(*conn).Read(0xc68eafc010, 0xc9687fa000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:176 +0x6d
bufio.(*Reader).Read(0xc7a08099e0, 0xc705d03a90, 0xc, 0xc, 0xc, 0x0, 0x0)
/usr/local/go/src/bufio/bufio.go:213 +0x30b
io.ReadAtLeast(0x1ffe600, 0xc7a08099e0, 0xc705d03a90, 0xc, 0xc, 0xc, 0xc, 0x0, 0x0)
/usr/local/go/src/io/io.go:309 +0x86
io.ReadFull(0x1ffe600, 0xc7a08099e0, 0xc705d03a90, 0xc, 0xc, 0x0, 0x0, 0x0)
/usr/local/go/src/io/io.go:327 +0x58
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recvLoop(0xc64485e8f0, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:451 +0xfc
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recv(0xc64485e8f0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:430 +0x2b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:108 +0x3e4
goroutine 1602 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).send(0xc64485e8f0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:392 +0x2ca
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:109 +0x406
goroutine 1560 [chan receive, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc8fc1ca300, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc8973e9c8a, 0x12903c0, 0xc8973e9c00, 0x1, 0x1, 0x61, 0xc662b94b40, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Server).reconcileJobSummaries(0xc4201fa280, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:762 +0x1ad
github.com/hashicorp/nomad/nomad.(*Server).establishLeadership(0xc4201fa280, 0xc91388c780, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:260 +0x577
github.com/hashicorp/nomad/nomad.(*Server).leaderLoop(0xc4201fa280, 0xc91388c780)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:108 +0x54f
github.com/hashicorp/nomad/nomad.(*Server).monitorLeadership.func1(0xc6d87de040, 0xc4201fa280, 0xc91388c780)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:63 +0x5b
created by github.com/hashicorp/nomad/nomad.(*Server).monitorLeadership
/opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:61 +0x1f6
goroutine 1569 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*EvalBatcher).batcher(0xc7a0a104b0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/batcher.go:63 +0x263
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.NewEvalBatcher
/opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/batcher.go:36 +0xd4
goroutine 1566 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).runDelayedEvalsWatcher(0xc4203303c0, 0x200f200, 0xc8528b7d80)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:761 +0x227
created by github.com/hashicorp/nomad/nomad.(*EvalBroker).SetEnabled
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:170 +0xf0
goroutine 1618 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew(0x200f200, 0xc8528b7dc0, 0xc420042cd8, 0x20, 0x20, 0x5, 0x1599762)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 +0x984
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0xc99fd11dd0, 0x200f200, 0xc8528b7dc0, 0x20d3d40, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch.go:82 +0x129
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc4203fda00, 0xc946993ec8, 0x1, 0x200f200, 0xc8528b7dc0, 0x405913, 0xc6ec3e4120, 0x16d2250, 0xc6ec3e4001, 0x646263)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:146 +0x19f
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*Watcher).getDeploys(0xc6d73ecf80, 0x200f200, 0xc8528b7dc0, 0x1, 0xc71aac2060, 0xc6ec3e4000, 0xc8ed83e040, 0xc420252000, 0x20002, 0xc94d66bfa4)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/deployments_watcher.go:176 +0x72
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*Watcher).watchDeployments(0xc6d73ecf80, 0x200f200, 0xc8528b7dc0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/deployments_watcher.go:148 +0x231
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.(*Watcher).SetEnabled
/opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/deployments_watcher.go:121 +0xf2
goroutine 1563 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).run(0xc5dd049f40)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:108 +0x1bb
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).Start
/opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:80 +0x13c
goroutine 1738 [select]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).waitForSchedulers.func1(0xc8750af020, 0xc99e3ec300, 0xc99e3ec240)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:469 +0xd1
created by github.com/hashicorp/nomad/nomad.(*EvalBroker).waitForSchedulers
/opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:468 +0x103
Yeah I don't want you sharing those if you're templating in secret data. Many people use the vault integration so secrets aren't leaked into job specs.
I think we may have reproduced a condition that can cause memory bloat under your use case. I'll share in more detail tomorrow, but wanted to get a reply to you today.
Sounds good, looking forward to the fix.
Did you see my note about your internal load testing?
Because it still seems that you're advertising a product for production work loads that you've never actually tested at scale yourselves, which is obviously pretty astounding.
As an observer, I feel for Miserlou's pain and frustration as he keeps tackling the various issues he is facing.
Though, @Miserlou I have one thing to add; "You are lucky" ... Having HashiStack is "actually a good thing" ™️ 😁 as Nomad is extremely flexible yet adequately rigid.
I have been trying to convince folks to adopt HashiStack, though without much luck, mostly due to their off-the-radar reasons like: it is not well known (!) maintaining Nomad servers is an overhead (!!) ** we don't easily find people who know "Nomad" (vs. $alternative_stack ... whatever)
From personal experience having raw_exec
in addition to docker
has helped solved a bunch of problems very easily.
All that said ... coming back to issue at hand and some brainstorming ...
are your servers in an ASG with desired, min and max set to fixed value (3 in this case)?
in the cross linked issue, just restarting nomad may not help; as sometimes datadir
corruption can occur, and may need a datadir wipe as well.
*** FWIW, I have my Nomad agents under systemd with a 10 second delay between restarts and a separate cron job which checks 'systemctl status nomad' to determine if a datdir wipe is needed (sometimes the "invalid task bucket" error pops up)
HTH, Regards, Shantanu
Before I dig into the details of the fix I think it important to highlight that you may still be misusing Nomad. Nomad is a workload scheduler and is suitable as a replacment for a classic job queue system in many applications. However it is not designed to be a distributed durable queuing system such as Kafka for example. Nomad must hold all pending Jobs in memory to schedule them, it currently has no swap to disk feature and will crash if you DOS it.
This can happen for example if your elastic client pool suddenly needs to expand in response to an unexpected burst in workload. While new nodes are coming online, Jobs are piling up in memory, waiting to be placed. I think an architecture where you have a feedback loop and some backpressure into whats sending dispatch requests would add some resilience to your deployment. Testing the upper bound of your deployment and using that data to watch metrics like nomad.blocked_evals.total_blocked
to know when to rate limit would help. We're taking note of some things here to get ideas on how to improve Nomad for usecases such as this, but this is my recommendation for the state of today.
As for the bug you really already nailed it in #4422. Each Job in Nomad maintains a small overhead of memory to track the nomad.job_summary.*
stats. This is all fine for users who are using service jobs for example where there is a low cardinality of jobs with high counts. It is however problematic for usecases where there are arbitrarily large number of single count jobs since this overhead can add up significantly over 10s/100s of thousands of jobs. I tested this locally by dispatching 5k allocs to a nomad cluster and observed well over double the memory usage of the leader compared to the followers:
After profiling the heap it was apparent this was due to publishing metrics
Showing nodes accounting for 88.87MB, 60.27% of 147.45MB total
Dropped 46 nodes (cum <= 0.74MB)
Showing top 10 nodes out of 84
flat flat% sum% cum cum%
1MB 0.68% 0.68% 93.37MB 63.32% github.com/hashicorp/nomad/nomad.(*Server).publishJobSummaryMetrics
24.50MB 16.62% 17.30% 92.37MB 62.64% github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).SetGaugeWithLabels
0 0% 17.30% 92.37MB 62.64% github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.SetGaugeWithLabels
35.86MB 24.32% 41.61% 67.86MB 46.02% github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSink).SetGaugeWithLabels
0 0% 41.61% 50.51MB 34.25% github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1
I'm introducing a couple new flags to the telemetry
config stanza to help with this:
prefix_filter
- This is modeled after the option in Consul for defining prefixes to include and exclude from metricsdisable_dispatched_job_summary_metrics
- This long winded option disables registration of job_summary metrics entirely if the job has been dispatched from a parameterized job. Specifically disable_dispatched_job_summary_metrics
is what you want.
It should allow you to turn telemetry back on as it will filter all all of the job_summary stats for your dispatched jobs.
When doing this with my local cluster I was able to reduce the memory usage of my leader by over %40.
I'm working to get this merge into our 0.8.7 branch so we can get this out ASAP. That PR is here: #4878
Hey Nick, big thanks for the deep dive on this. We've added a limiter for the number of dispatched jobs, which are all stored in a database anyway, so hopefully that gives us a safety buffer.
We are now well aware that Nomad isn't a suitable tool for our needs - however we got this impression because you are lying to people about Nomad's capabilities both online and at conferences. Dispatch jobs are a hack and it's dishonest to advertise them as a replacement for queues, even though the title of your blog post is "Replacing Queues with Nomad Dispatch", without any mention of these crippling limitations. I'm not even going to start on the inability to handle job timeouts, the lack of forward-looking prioritized job scheduling, lack of constraints across dispatched jobs, lack of recommendations for server:job ratios, lack of load testing results, lack of clients publicly using the stack, etc..
I also don't think that disable_dispatched_job_summary_metrics
is a great solution to this problem (at least as I understand it) - a good solution would be to give us actually usable dispatch job metrics. As far as I can see it, you've left the problem in place, so the next person who gets duped by your marketing materials is going to fall into the same trap until they find this ticket and realize that there's a "make it suck less" argument that they need to set.
Anyway, I gotta say I really do appreciate you taking the time to finally address this issue and I'm really looking forward to the patch landing, which will likely be a big help to us.
Yep, fell into the same trap. I've been fighting with the dispatch jobs for the last week.
My current idea is to simply schedule consumers with nomad and leave the actual job metadata in a queue from which they will pull the tasks. Sadly this means complete re-engineering of the entire architecture.
Maybe I'll try the backpressure trick as mentioned by @Miserlou (store jobs in a queue and have nomad periodically fetch some and dispatch them). With an aggresive GC this might somewhat work.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
You know the drill by now - use Nomad for a heavy workload, come back from the weekend to find that the entire system has completely collapsed because the lead server is using 100% CPU and won't accept any connections, even to
nomad status
itself. Classic.