hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.95k stars 1.96k forks source link

Nomad Lead Server at 100% CPU Usage Leading to Total System Death #4864

Closed Miserlou closed 6 years ago

Miserlou commented 6 years ago

screen shot 2018-11-12 at 9 23 12 am

You know the drill by now - use Nomad for a heavy workload, come back from the weekend to find that the entire system has completely collapsed because the lead server is using 100% CPU and won't accept any connections, even to nomad status itself. Classic.

nickethier commented 6 years ago

Hey @Miserlou sorry you're experiencing issues again.

Please share the logs for this node as well as other server nodes spanning this incident. Debug level would be especially helpful. Could we also get your nomad server configuration if possible and the version of nomad you're running. As for connections, can you provide the open file handles and what your file handle limit is. Further, can you enable debug = true in your config and provide a profile of Nomad:

go tool pprof -proto -output=prof.proto http://127.0.0.1:4646/debug/pprof/profile?seconds=30

What is the the state of the system during the screenshot. It appears Nomad only has a few seconds of process time. Is something restarting nomad? Also is one process a child of another or are there actually two nomad agents running on the node? I know some details are spread across a few of your other tickets but a summary of your deployment architecture here would be helpful. How are the servers deployed, what are the machine sizes, rate of the incoming dispatch requests etc. Additionally could you elaborate as to your usecase of Nomad.

I understand the frustration of coming in first thing to a crashed system, but there's really nothing we can do with the information you've provided in the initial ticket. Please be as detailed as possible so we can help as best we can. In the future please follow the issue template when opening issues.

Miserlou commented 6 years ago
$ go tool pprof -proto -output=prof.proto http://127.0.0.1:4646/debug/pprof/profile?seconds=30
Fetching profile from http://127.0.0.1:4646/debug/pprof/profile?seconds=30
Please wait... (30s)
parsing profile: unrecognized profile format

Viewing the UI directly gives this:

screen shot 2018-11-13 at 12 19 18 pm

This looks like it was the crash:

screen shot 2018-11-13 at 12 31 02 pm

I can't enable debug now since this is a prod system and it's already happened. File max is at 3203524. The SS is after a manual restart, before that it was about 3 days, I don't know how long it had been frozen. By the throughput, I'm guessing it was for most of that time. The server machines are m5.2xlarge (4x), the workers are x1.32xlarge(1-10x elastic).

This time it looks like logging stopped, as I can find no actionable logs. In the working directory there is a 226M raft.db and a 3.0GB state.bin. The server config is trivial:

log_level = "WARN"

# Setup data dir
data_dir = "/tmp/nomad_server1"

# Enable the server
server {
    enabled = true

    # Only 1 for the lead server, 3 for the others.
    bootstrap_expect = 1
}

consul {
  auto_advertise      = false
  server_auto_join    = false
  client_auto_join    = false
}

Obviously, this isn't the first time that Nomad server failure has called total system collapse: https://github.com/hashicorp/nomad/issues/4323 although this time the outage cost us about $4,000 as it happened undetected over a weekend. I'm responsible for maintaining a system which is completely incapable of doing what it claims. The decision to use this technology was made before I arrived. It is too late to turn around. Every day is hell for me. I feel like the victim of fraud who has no choice but to wake up everyday and be defrauded over and over again.

Miserlou commented 6 years ago

As a reminder, when this problem occurs, it takes down all servers in the cluster.

Restarting the server manually appears to cause a No cluster leader problem, so it doesn't look like this will even be solvable by a monit based solution as the cluster is completely unrecoverable once this occurs. I'd like to try trashing the working directories and starting over but I'll keep them for now in case I can extract information which would be useful for you out of them.

root@ip-10-0-233-82:/home/ubuntu# nomad agent -config server.hcl
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from server.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

                Client: false
             Log Level: WARN
                Region: global (DC: dc1)
                Server: true
               Version: 0.8.3

==> Nomad agent started! Log data will stream in below:

    2018/11/13 17:42:54 [WARN] snapshot: Found temporary snapshot: 112-7593582-1541858894814.tmp
    2018/11/13 17:45:02 [WARN] serf: Failed to re-join any previously known node
    2018/11/13 17:45:03 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2018/11/13 17:45:03 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:03 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:05 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:05 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:05 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:07 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:07 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:07 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:08 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:10 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:10 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:10 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:11 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:11 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:11 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:12.427247 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:12.505918 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:12.572498 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:12.607766 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:12.628045 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:12.655025 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:12.817723 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:12.857453 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:13 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:13 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:13 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:14 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:14 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:14 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:16 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:16 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:16 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:17.532251 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:17.751311 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:17.770038 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:17.803579 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:17.836089 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:17.872673 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:17 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:17 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:17 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:17.938675 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:18.027099 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:19 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:19 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:19 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:20 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:20 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:20 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:21 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:21 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:21 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:22.972693 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:23 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:23 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:23 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:23.095370 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:23.142658 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:23.143297 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:23.194273 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:23.321416 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:23.372765 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:23.415345 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:24 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:24 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:24 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:26 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:26 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:26 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:27 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:27 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:27 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:28 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:28 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:28 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:29.402827 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:29.424376 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:29.454753 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:29.558519 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:29.561747 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:29.637734 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:29.661212 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:29.700500 [ERR] worker: failed to dequeue evaluation: No cluster leader
    2018/11/13 17:45:30 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:30 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:30 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
    2018/11/13 17:45:31 [WARN] raft: Election timeout reached, restarting election
    2018/11/13 17:45:31 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 17:45:31 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.122.116:4647 10.0.122.116:4647}: dial tcp 10.0.122.116:4647: getsockopt: connection refused
etc

Restarting all of the other servers gives just:

    2018/11/13 17:51:10.824468 [ERR] worker: failed to dequeue evaluation: eval broker disabled
    2018/11/13 17:51:11.568314 [ERR] worker: failed to dequeue evaluation: eval broker disabled
etc

I have asked this so many times anybody has ever answered me - how do you load test Nomad? It seems like this is the sort of thing that would have been discovered if you did any serious workload on Nomad at all.

Miserlou commented 6 years ago

After about 15 minutes it actually did start an election again, but one of the agents simply died in the process for no reason and so everything failed:

    2018/11/13 17:52:54.243597 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.001945 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.020823 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.053997 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.174553 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.174625 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.182779 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.190254 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:53:04.244400 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2018/11/13 17:54:00 [WARN] raft: Heartbeat timeout from "10.0.233.82:4647" reached, starting election
    2018/11/13 17:54:27 [WARN] raft: Heartbeat timeout from "10.0.233.82:4647" reached, starting election
    2018/11/13 17:54:43 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-56-134.global)
    2018/11/13 17:54:43 [ERR] raft-net: Failed to flush response: write tcp 10.0.56.134:4647->10.0.233.82:49250: write: broken pipe
    2018/11/13 17:54:45 [WARN] raft: Election timeout reached, restarting election
Killed
nickethier commented 6 years ago

Hey @Miserlou I'm working on going through this info. What immediately sticks out at me is in your server config you have:

# Only 1 for the lead server, 3 for the others.
    bootstrap_expect = 1

Does this mean this value is set to 3 on other servers and if so why?

Additionally it looks like you're running 4 servers?

The server machines are m5.2xlarge (4x)

This shouldn't be an issue, but 3 servers will actually yield better performance without sacrificing any reliability. You can only loose 1 server in either scenario to maintain quorum but 3 servers means one less participant in raft.

When you say that your workers (I assume nomad clients) are elastic, does this mean they're managed by an autoscaling group. If so do you happen to have a graph of the number of clients during the time of the crash/lockup as well as a graph of the NomadQueueLength you posted in #4323? Thanks.

Miserlou commented 6 years ago

Sorry, it is actually 3 servers, not 4.

bootstrap_expect is set to 3 on the others:

# Enable the server
server {
  enabled = true

  # Self-elect, should be 3 or 5 for production
  bootstrap_expect = 3

  # This is the IP address of the first server we provisioned
  retry_join = ["X.X.X.X:4648"]
}

This is because it says to in the documentation.

Queue length looks like you'd expect. screen shot 2018-11-13 at 1 21 28 pm

I can't seem to find a metric for how many clients were online at the time of the crash. It was probably between 0 and 20. =/

Miserlou commented 6 years ago

Restarting keeps causing that same Killed message on the 3rd server client. Not sure if that's related or a separate ticket:

    2018/11/13 18:20:32 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
    2018/11/13 18:20:49 [WARN] raft: Heartbeat timeout from "10.0.122.116:4647" reached, starting election
    2018/11/13 18:21:38 [ERR] memberlist: Push/Pull with ip-10-0-233-82.global failed: write tcp 10.0.56.134:56068->10.0.233.82:4648: i/o timeout
    2018/11/13 18:21:59 [ERR] memberlist: Failed to send TCP ack: write tcp 10.0.56.134:4648->10.0.233.82:46372: i/o timeout from=10.0.233.82:46372
    2018/11/13 18:22:39 [ERR] yamux: keepalive failed: session shutdown
Killed
nickethier commented 6 years ago

bootstrap_expect should be the same on all nodes. From the docs:

A value of 1 does not provide any fault tolerance and is not recommended for production use cases.

Additionally enabling debug mode and setting the logging level to DEBUG would really help if this happened again. Debug mode doesn't affect the cluster until you attach the debugger to it then it is a very small amount of performance overhead. Debug logging will produce additional logs, but otherwise won't impact the server and is safe to do in production.

There are several threads to pull at here. If you could provide the logs from the weekend, perhaps via dropbox or S3 if they're too large to share here. If you'd be willing to let us look at the raft.db and state.bin files that could help as well.

That Killed message literally means something sent a SIGKILL to the process, its not from Nomad.

nickethier commented 6 years ago

Would you share a graph of the memory usage as well? Any chance there are OOM killer events in dmesg?

Miserlou commented 6 years ago

Actually that time it happened on two servers:

    2018/11/13 18:23:11 [ERR] raft: peer {Voter 10.0.233.82:4647 10.0.233.82:4647} has newer term, stopping replication
    2018/11/13 18:23:55 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
    2018/11/13 18:25:44 [ERR] memberlist: Failed to send TCP ack: write tcp 10.0.122.116:4648->10.0.233.82:34610: i/o timeout from=10.0.233.82:34610
    2018/11/13 18:25:51 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2018/11/13 18:25:53 [ERR] memberlist: Failed to push local state: write tcp 10.0.122.116:4648->10.0.233.82:34512: i/o timeout from=10.0.233.82:34512
    2018/11/13 18:26:05 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
    2018/11/13 18:26:25 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.56.134:4647 10.0.56.134:4647}: dial tcp 10.0.56.134:4647: getsockopt: connection refused
    2018/11/13 18:27:21 [ERR] memberlist: Failed to send TCP ack: write tcp 10.0.122.116:4648->10.0.233.82:34920: i/o timeout from=10.0.233.82:34920
    2018/11/13 18:27:21 [WARN] memberlist: Refuting a suspect message (from: ip-10-0-233-82.global)
Killed

It's way, way, way too noisy to use DEBUG on our production systems - especially when this particular error occurs, it generated gigabytes of log output in a few seconds.

It does look like those Killed messages are system OOMs:

[3636251.619735] Out of memory: Kill process 3870 (nomad) score 944 or sacrifice child
[3636251.622470] Killed process 3870 (nomad) total-vm:32370228kB, anon-rss:31232628kB, file-rss:0kB

CWL doesn't automatically track memory usage over time (AFAIK), so I can't give you that graph.

"Throw more memory at it" was our solution to this problem last time but I have to say that I'm very unhappy with that as we still have the problem, only now it's more expensive.

nickethier commented 6 years ago

Progress! So Nomad is running out of memory. Now lets try a few things to figure out why. First if you are able to share the state.bin and raft.db files that would help.

Try grepping for this in your server logs of the incident:

attempting to apply large raft entry

Also could you share the job you're dispatching?

nickethier commented 6 years ago

Do you have graphs for all of Nomad's metrics? Seeing the following would be really helpful:

Miserlou commented 6 years ago

We have lots of different jobs, many of them are in this directory: https://github.com/AlexsLemonade/refinebio/blob/dev/workers/nomad-job-specs/salmon.nomad.tpl - that's a template, it ends up being about 70 job specs total at registration time.

I'm not sure about sharing the files because a) they're very big and b) I don't know if they're safe to share. What information do they contain? Obviously I want to make sure that they don't contain any keys or environment variables that are passed around.

May plan is to just implement a governor to put a ceiling on the number of dispatch jobs at once and add monit to restart server processes that fail. However, I am not confident that it will be enough to get around the underlying failure.

How do I get those values you requested? We have disabled telemetry due to it being significantly worse than useless for dispatch jobs.

Miserlou commented 6 years ago
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x15b4a85, 0x16)
    /usr/local/go/src/runtime/panic.go:605 +0x95
runtime.sysMap(0xcb294c0000, 0xbf40000, 0x20bb200, 0x20d4978)
    /usr/local/go/src/runtime/mem_linux.go:216 +0x1d0
runtime.(*mheap).sysAlloc(0x20baa60, 0xbf40000, 0x7fd55913f428)
    /usr/local/go/src/runtime/malloc.go:470 +0xd7
runtime.(*mheap).grow(0x20baa60, 0x5fa0, 0x0)
    /usr/local/go/src/runtime/mheap.go:887 +0x60
runtime.(*mheap).allocSpanLocked(0x20baa60, 0x5fa0, 0x20d4988, 0x7fd5590f9518)
    /usr/local/go/src/runtime/mheap.go:800 +0x334
runtime.(*mheap).alloc_m(0x20baa60, 0x5fa0, 0xffffffffffff0100, 0x7fd5b5ffadf8)
    /usr/local/go/src/runtime/mheap.go:666 +0x118
runtime.(*mheap).alloc.func1()
    /usr/local/go/src/runtime/mheap.go:733 +0x4d
runtime.systemstack(0x7fd5b5ffae10)
    /usr/local/go/src/runtime/asm_amd64.s:360 +0xab
runtime.(*mheap).alloc(0x20baa60, 0x5fa0, 0x7fd5b5010100, 0x4168b4)
    /usr/local/go/src/runtime/mheap.go:732 +0xa1
runtime.largeAlloc(0xbf40000, 0x7fd5bf200001, 0xcb2928d209)
    /usr/local/go/src/runtime/malloc.go:827 +0x98
runtime.mallocgc.func1()
    /usr/local/go/src/runtime/malloc.go:722 +0x46
runtime.systemstack(0xc420024000)
    /usr/local/go/src/runtime/asm_amd64.s:344 +0x79
runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1135

goroutine 1682 [running]:
runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:298 fp=0xc6d87f0898 sp=0xc6d87f0890 pc=0x45c440
runtime.mallocgc(0xbf40000, 0x1455380, 0x1ffe701, 0x3)
    /usr/local/go/src/runtime/malloc.go:721 +0x7ae fp=0xc6d87f0940 sp=0xc6d87f0898 pc=0x412f2e
runtime.newarray(0x1455380, 0x44000, 0xc42001e070)
    /usr/local/go/src/runtime/malloc.go:853 +0x60 fp=0xc6d87f0970 sp=0xc6d87f0940 pc=0x413290
runtime.makeBucketArray(0x137a960, 0xc6d87f0a12, 0xc60daca340, 0xc5e0368180)
    /usr/local/go/src/runtime/hashmap.go:927 +0xf5 fp=0xc6d87f09b8 sp=0xc6d87f0970 pc=0x40c1d5
runtime.hashGrow(0x137a960, 0xc8cfdd9a10)
    /usr/local/go/src/runtime/hashmap.go:951 +0xa3 fp=0xc6d87f0a18 sp=0xc6d87f09b8 pc=0x40c393
runtime.mapassign_faststr(0x137a960, 0xc8cfdd9a10, 0xcb29406fc0, 0x6f, 0xcb293ffd80)
    /usr/local/go/src/runtime/hashmap_fast.go:666 +0x28d fp=0xc6d87f0a98 sp=0xc6d87f0a18 pc=0x40eedd
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSink).SetGaugeWithLabels(0xc4201c6190, 0xcb29408ba0, 0x4, 0x6, 0xcb00000000, 0xcb293ffd80, 0x3, 0x4)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem.go:167 +0x138 fp=0xc6d87f0b70 sp=0xc6d87f0a98 pc=0x7756d8
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).SetGaugeWithLabels(0xc4205180b0, 0xc775439110, 0x4, 0x6, 0xc600000000, 0xc865eff2c0, 0x3, 0x4)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/metrics.go:41 +0x345 fp=0xc6d87f0c78 sp=0xc6d87f0b70 pc=0x779685
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.SetGaugeWithLabels(0xc775439110, 0x3, 0x3, 0xca00000000, 0xc865eff2c0, 0x2, 0x2)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/start.go:96 +0xce fp=0xc6d87f0cd8 sp=0xc6d87f0c78 pc=0x77d7ae
github.com/hashicorp/nomad/nomad.(*Server).publishJobSummaryMetrics(0xc4201fa280, 0xc91388c780)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:634 +0xeb7 fp=0xc6d87f0fd0 sp=0xc6d87f0cd8 pc=0x100e227
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc6d87f0fd8 sp=0xc6d87f0fd0 pc=0x45f071
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:239 +0x485

goroutine 1 [select, 16 minutes]:
github.com/hashicorp/nomad/command/agent.(*Command).handleSignals(0xc4201e76c0, 0x160c940)
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:562 +0x22d
github.com/hashicorp/nomad/command/agent.(*Command).Run(0xc4201e76c0, 0xc4200100e0, 0x2, 0x2, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/command.go:551 +0xd45
github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc4201f8640, 0xc4201f8640, 0xc4201bc3a0, 0x37)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/mitchellh/cli/cli.go:255 +0x1eb
main.RunCustom(0xc4200100d0, 0x3, 0x3, 0xc4204f5f60)
    /opt/gopath/src/github.com/hashicorp/nomad/main.go:127 +0x4c9
main.Run(0xc4200100d0, 0x3, 0x3, 0x11ce214)
    /opt/gopath/src/github.com/hashicorp/nomad/main.go:72 +0x3f
main.main()
    /opt/gopath/src/github.com/hashicorp/nomad/main.go:68 +0x63

goroutine 5 [syscall, 18 minutes]:
os/signal.signal_recv(0x0)
    /usr/local/go/src/runtime/sigqueue.go:131 +0xa6
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.0
    /usr/local/go/src/os/signal/signal_unix.go:28 +0x41

goroutine 82 [select]:
github.com/hashicorp/nomad/helper/pool.(*ConnPool).reap(0xc42033b3e0)
    /opt/gopath/src/github.com/hashicorp/nomad/helper/pool/pool.go:439 +0x37b
created by github.com/hashicorp/nomad/helper/pool.NewPool
    /opt/gopath/src/github.com/hashicorp/nomad/helper/pool/pool.go:170 +0x179

goroutine 32 [select]:
github.com/hashicorp/nomad/command/agent/consul.checkConsulTLSSkipVerify(0x200f200, 0xc4201cc040, 0xc4201c6230, 0x2017de0, 0xc4200d7c60, 0xc42001a180)
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/version_checker.go:42 +0x1ba
created by github.com/hashicorp/nomad/command/agent/consul.(*ServiceClient).Run
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/client.go:287 +0x12c

goroutine 16 [select, 18 minutes]:
github.com/hashicorp/nomad/command/agent/consul.(*ServiceClient).Run(0xc4201d43c0)
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/consul/client.go:293 +0x26a
created by github.com/hashicorp/nomad/command/agent.(*Agent).setupConsul
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/agent.go:950 +0x145

goroutine 15 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:65 +0x130
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).collectStats(0xc4205180b0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/metrics.go:168 +0x2f
created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.New
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/start.go:75 +0x12b

goroutine 14 [select, 18 minutes]:
github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSignal).run(0xc4201c0280)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:64 +0xf8
created by github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.NewInmemSignal
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics/inmem_signal.go:38 +0x162

goroutine 13 [select, 16 minutes, locked to thread]:
runtime.gopark(0x1610910, 0x0, 0x159dab6, 0x6, 0x18, 0x1)
    /usr/local/go/src/runtime/proc.go:287 +0x12c
runtime.selectgo(0xc4204a2f50, 0xc4202a2180)
    /usr/local/go/src/runtime/select.go:395 +0x1149
runtime.ensureSigM.func1()
    /usr/local/go/src/runtime/signal_unix.go:511 +0x220
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2337 +0x1

goroutine 1488 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530d980)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f

goroutine 83 [select, 6 minutes]:
github.com/hashicorp/nomad/nomad.(*RaftLayer).Accept(0xc4201c6500, 0x160ddc0, 0xc4205184d0, 0x200f200, 0xc4203af640)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/raft_rpc.go:63 +0xf8
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).listen(0xc4205184d0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:466 +0x49
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.NewNetworkTransportWithConfig
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:166 +0x1c1

goroutine 87 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).leaderLoop(0xc420252000)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:500 +0x458
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runLeader(0xc420252000)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:420 +0x3f4
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).run(0xc420252000)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:140 +0x67
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.run)-fm()
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/api.go:505 +0x2a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc6d8483140)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 88 [running]:
    goroutine running on other thread; stack unavailable
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 89 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runSnapshots(0xc420252000)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/snapshot.go:71 +0x166
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.runSnapshots)-fm()
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/api.go:507 +0x2a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc6d8483160)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 90 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*serfQueries).stream(0xc66fa50a50)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/internal_query.go:80 +0xf7
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.newSerfQueries
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/internal_query.go:73 +0xd1

goroutine 91 [runnable]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Snapshotter).stream(0xc6d5d51380)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/snapshot.go:189 +0x248
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.NewSnapshotter
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/snapshot.go:131 +0x5c3

goroutine 92 [IO wait]:
internal/poll.runtime_pollWait(0x7fd5bf1aed30, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6d73ecd18, 0x72, 0xc6d88d5d00, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6d73ecd18, 0xffffffffffffff00, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc6d73ecd00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:335 +0x1e2
net.(*netFD).accept(0xc6d73ecd00, 0xc7623b2000, 0x0, 0x0)
    /usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc6d72e1358, 0xf4a324, 0x45b7b0, 0xc6d88d5f90)
    /usr/local/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).AcceptTCP(0xc6d72e1358, 0x160dd78, 0xc6d73e3950, 0xc7623b2000)
    /usr/local/go/src/net/tcpsock.go:234 +0x49
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).tcpListen(0xc6d73e3950)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/net.go:204 +0x33
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.newMemberlist
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/memberlist.go:144 +0x7a9

goroutine 93 [IO wait]:
internal/poll.runtime_pollWait(0x7fd5bf1aef70, 0x72, 0x0)
    /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6d73ecd98, 0x72, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6d73ecd98, 0xcaca0c2000, 0x10000, 0x10000)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).ReadFrom(0xc6d73ecd80, 0xcaca0c2000, 0x10000, 0x10000, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:170 +0x19a
net.(*netFD).readFrom(0xc6d73ecd80, 0xcaca0c2000, 0x10000, 0x10000, 0xc6d73e3950, 0xc768cfa000, 0x96, 0x10000, 0x412a57)
    /usr/local/go/src/net/fd_unix.go:208 +0x5e
net.(*UDPConn).readFrom(0xc6d72e1360, 0xcaca0c2000, 0x10000, 0x10000, 0xc6d8879e30, 0xc6d8879e88, 0x101, 0xc6d8879ea8)
    /usr/local/go/src/net/udpsock_posix.go:47 +0x6a
net.(*UDPConn).ReadFrom(0xc6d72e1360, 0xcaca0c2000, 0x10000, 0x10000, 0x10000, 0x10000, 0x695f, 0x1066112c0f8, 0x20af600)
    /usr/local/go/src/net/udpsock.go:118 +0x72
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).udpListen(0xc6d73e3950)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/net.go:302 +0x12b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.newMemberlist
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/memberlist.go:145 +0x7ce

goroutine 94 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).udpHandler(0xc6d73e3950)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/net.go:400 +0x11a
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.newMemberlist
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/memberlist.go:146 +0x7f3

goroutine 95 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).triggerFunc(0xc6d73e3950, 0x12a05f200, 0xc4b7713f80, 0xc64d314060, 0xc6d8483690)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:116 +0x1d0
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).schedule
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:82 +0x3fb

goroutine 96 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).pushPullTrigger(0xc6d73e3950, 0xc64d314060)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:143 +0x21b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).schedule
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:88 +0x349

goroutine 97 [runnable]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).triggerFunc(0xc6d73e3950, 0x1dcd6500, 0xc4b7388000, 0xc64d314060, 0xc6d84836a0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:116 +0x1d0
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist.(*Memberlist).schedule
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/memberlist/state.go:94 +0x1f3

goroutine 130 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).handleReap(0xc57d243800)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1406 +0x168
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:407 +0xa36

goroutine 131 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).handleReconnect(0xc57d243800)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1424 +0x115
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:408 +0xa5b

goroutine 132 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).checkQueueDepth(0xc57d243800, 0x159ca84, 0x6, 0xc66fb45020)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1523 +0x108
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:409 +0xa9e

goroutine 133 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).checkQueueDepth(0xc57d243800, 0x159b5a3, 0x5, 0xc66fb450e0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1523 +0x108
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:410 +0xae4

goroutine 134 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.(*Serf).checkQueueDepth(0xc57d243800, 0x159b76f, 0x5, 0xc66fb45110)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:1523 +0x108
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf.Create
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/serf/serf/serf.go:411 +0xb2a

goroutine 136 [semacquire]:
sync.runtime_notifyListWait(0xc60c171d90, 0x0)
    /usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171d80)
    /usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388060)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388060, 0x1dcd6500, 0x24, 0xc5b9ebf440, 0x24, 0x1, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388060)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 137 [semacquire]:
sync.runtime_notifyListWait(0xc60c171dd0, 0x0)
    /usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171dc0)
    /usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388120)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388120, 0x1dcd6500, 0x24, 0xc5b13bc630, 0x24, 0x1, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388120)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 138 [semacquire]:
sync.runtime_notifyListWait(0xc60c171e10, 0x0)
    /usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171e00)
    /usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388180)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388180, 0x1dcd6500, 0x24, 0xc5a951a060, 0x24, 0x1, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388180)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 139 [semacquire]:
sync.runtime_notifyListWait(0xc60c171ed0, 0x0)
    /usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171ec0)
    /usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b73881e0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b73881e0, 0x1dcd6500, 0x24, 0xc5aa3a0420, 0x24, 0x1, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b73881e0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 140 [semacquire]:
sync.runtime_notifyListWait(0xc60c171f10, 0x0)
    /usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171f00)
    /usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b73882a0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b73882a0, 0x1dcd6500, 0x24, 0xc5b0a176e0, 0x24, 0x1, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b73882a0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 141 [semacquire]:
sync.runtime_notifyListWait(0xc60c171f90, 0x0)
    /usr/local/go/src/runtime/sema.go:507 +0x110
sync.(*Cond).Wait(0xc60c171f80)
    /usr/local/go/src/sync/cond.go:56 +0x80
github.com/hashicorp/nomad/nomad.(*Worker).checkPaused(0xc4b7388300)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:100 +0x45
github.com/hashicorp/nomad/nomad.(*Worker).dequeueEvaluation(0xc4b7388300, 0x1dcd6500, 0x24, 0xc5b13bd230, 0x24, 0x1, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:154 +0x141
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388300)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:109 +0x38
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 142 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc902d70200, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc482077806, 0x1420740, 0xc82205aae0, 0x24, 0x0, 0x0, 0x20af600, 0xc89b5b0910)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc82205aae0, 0xc89b5b0910, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc42051c548, 0x13, 0x159a5af, 0x4, 0xc6ae70da10, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc42051c548, 0x13, 0xc6ae70da10, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc4201c04c0, 0xc4201c6460, 0xc89b5b08f8, 0xc420533000, 0xc668312000, 0x1420740, 0xc82205aae0, 0x16, 0x12903c0, 0xc89b5b0910, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc4201c6460, 0x200f540, 0xc8367e4a80, 0x13c9200, 0xc6ae70db01)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc4201fa280, 0x15a3a15, 0xb, 0x1420740, 0xc82205aa80, 0x12903c0, 0xc89b5b08f0, 0x20af600, 0xc78f316000)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1382 +0xc4
github.com/hashicorp/nomad/nomad.(*Worker).UpdateEval(0xc4b7388360, 0xc7f8fac160, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:388 +0x2c4
github.com/hashicorp/nomad/scheduler.setStatus(0xc4201c6230, 0x200f5c0, 0xc4b7388360, 0xc48207e6e0, 0x0, 0x0, 0x0, 0x15a03b7, 0x8, 0x0, ...)
    /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:455 +0x25e
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc89549a460, 0xc48207e6e0, 0xc4201c6230, 0x2017d20)
    /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:174 +0x925
github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc4b7388360, 0xc48207e6e0, 0xc5dae74480, 0x24, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:292 +0x460
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b7388360)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:127 +0x12a
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 143 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc6fa1f2100, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc48209b406, 0x1420740, 0xc6fb9da600, 0x24, 0x0, 0x0, 0x20af600, 0xc7e6c54260)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc6fb9da600, 0xc7e6c54260, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc42051c548, 0x13, 0x159a5af, 0x4, 0xc6d8863a10, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc42051c548, 0x13, 0xc6d8863a10, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc4201c04c0, 0xc4201c6460, 0xc7e6c54258, 0xc420533000, 0xc668376000, 0x1420740, 0xc6fb9da600, 0x16, 0x12903c0, 0xc7e6c54260, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc4201c6460, 0x200f540, 0xc7218f8500, 0x13c9200, 0xc6d8863b01)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc4201fa280, 0x15a3a15, 0xb, 0x1420740, 0xc6fb9da5a0, 0x12903c0, 0xc7e6c54250, 0x20af600, 0xc78f316000)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:1382 +0xc4
github.com/hashicorp/nomad/nomad.(*Worker).UpdateEval(0xc4b73883c0, 0xc72e188000, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:388 +0x2c4
github.com/hashicorp/nomad/scheduler.setStatus(0xc4201c6230, 0x200f5c0, 0xc4b73883c0, 0xc4820a0420, 0x0, 0x0, 0x0, 0x15a03b7, 0x8, 0x0, ...)
    /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:455 +0x25e
github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc86fef8000, 0xc4820a0420, 0xc4201c6230, 0x2017d20)
    /opt/gopath/src/github.com/hashicorp/nomad/scheduler/generic_sched.go:174 +0x925
github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc4b73883c0, 0xc4820a0420, 0xc5d02c42a0, 0x24, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:292 +0x460
github.com/hashicorp/nomad/nomad.(*Worker).run(0xc4b73883c0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:127 +0x12a
created by github.com/hashicorp/nomad/nomad.NewWorker
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/worker.go:82 +0x11b

goroutine 144 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*Server).monitorLeadership(0xc4201fa280)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:50 +0x10f
created by github.com/hashicorp/nomad/nomad.NewServer
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:371 +0x160d

goroutine 145 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*Server).serfEventHandler(0xc4201fa280)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/serf.go:28 +0xfe
created by github.com/hashicorp/nomad/nomad.NewServer
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:374 +0x1632

goroutine 146 [IO wait, 1 minutes]:
internal/poll.runtime_pollWait(0x7fd5bf1aeeb0, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4201dd098, 0x72, 0xc6d84d4d00, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc4201dd098, 0xffffffffffffff00, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc4201dd080, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:335 +0x1e2
net.(*netFD).accept(0xc4201dd080, 0x0, 0x1, 0x1)
    /usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc42051cdf0, 0xc6d84d4f18, 0x40712a, 0xc64d31c0c0)
    /usr/local/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc42051cdf0, 0xc64d31c0c0, 0x0, 0x3f800000, 0xc60c19c0c0)
    /usr/local/go/src/net/tcpsock.go:247 +0x49
github.com/hashicorp/nomad/nomad.(*Server).listen(0xc4201fa280, 0x200f200, 0xc60c19c0c0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:77 +0xae
created by github.com/hashicorp/nomad/nomad.(*Server).startRPCListener
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:405 +0x92

goroutine 147 [select]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).EmitStats(0xc4203303c0, 0x3b9aca00, 0xc4202a2720)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:818 +0x2f6
created by github.com/hashicorp/nomad/nomad.NewServer
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:380 +0x1685

goroutine 148 [select]:
github.com/hashicorp/nomad/nomad.(*PlanQueue).EmitStats(0xc4201c6410, 0x3b9aca00, 0xc4202a2720)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_queue.go:200 +0x123
created by github.com/hashicorp/nomad/nomad.NewServer
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:383 +0x16c7

goroutine 149 [select]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).EmitStats(0xc420134240, 0x3b9aca00, 0xc4202a2720)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:601 +0x123
created by github.com/hashicorp/nomad/nomad.NewServer
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:386 +0x1709

goroutine 150 [select]:
github.com/hashicorp/nomad/nomad.(*vaultClient).EmitStats(0xc420073380, 0x3b9aca00, 0xc4202a2720)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/vault.go:1210 +0x123
created by github.com/hashicorp/nomad/nomad.NewServer
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:389 +0x1750

goroutine 151 [select]:
github.com/hashicorp/nomad/nomad.(*Server).heartbeatStats(0xc4201fa280)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/heartbeat.go:169 +0x133
created by github.com/hashicorp/nomad/nomad.NewServer
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/server.go:392 +0x1775

goroutine 152 [IO wait, 16 minutes]:
internal/poll.runtime_pollWait(0x7fd5bf1aedf0, 0x72, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6d73ed018, 0x72, 0xc6d88e8c00, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6d73ed018, 0xffffffffffffff00, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Accept(0xc6d73ed000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:335 +0x1e2
net.(*netFD).accept(0xc6d73ed000, 0x4131b8, 0x30, 0x148af00)
    /usr/local/go/src/net/fd_unix.go:238 +0x42
net.(*TCPListener).accept(0xc6d72e1580, 0x6052ca, 0x148af00, 0xc66fb73b60)
    /usr/local/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc6d72e1580, 0xc4200140b8, 0x1363620, 0x1fe7390, 0x1563f60)
    /usr/local/go/src/net/tcpsock.go:247 +0x49
net/http.(*Server).Serve(0xc6d5d51790, 0x200d600, 0xc6d72e1580, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:2695 +0x1b2
net/http.Serve(0x200d600, 0xc6d72e1580, 0x2004740, 0xc66fb73a10, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:2323 +0x73
github.com/hashicorp/nomad/command/agent.NewHTTPServer.func1(0xc60c19c100, 0x200d600, 0xc6d72e1580, 0xc6d8483a60, 0xc66fb73890)
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:111 +0x82
created by github.com/hashicorp/nomad/command/agent.NewHTTPServer
    /opt/gopath/src/github.com/hashicorp/nomad/command/agent/http.go:109 +0x40f

goroutine 1541 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7052bc100, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5e08d2b06, 0x1420740, 0xc76fedc600, 0x24, 0x0, 0x0, 0x20af600, 0xc81c73aaa8)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc76fedc600, 0xc81c73aaa8, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc718547d48, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc718547d48, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73a648, 0xc6e473de80, 0xc8a14ee1e0, 0x1420740, 0xc76fedc600, 0x16, 0x12903c0, 0xc81c73aaa8, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc835e660c0, 0x0, 0x0)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc653508000, 0xc5bdc80190)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf

goroutine 1622 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad/drainer.(*NodeDrainer).run(0xc6d73e3b30, 0x200f200, 0xc8528b7e00)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drainer.go:218 +0x20a
created by github.com/hashicorp/nomad/nomad/drainer.(*NodeDrainer).SetEnabled
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drainer.go:189 +0xe7

goroutine 1487 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicateTo(0xc420252000, 0xc653c9f5c0, 0x74ab25, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:171 +0x82d
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate(0xc420252000, 0xc653c9f5c0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:135 +0x274
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).startStopReplication.func1()
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:451 +0x33
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a1651f20)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 694 [IO wait, 1 minutes]:
internal/poll.runtime_pollWait(0x7fd576803740, 0x72, 0x0)
    /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4abac0918, 0x72, 0xffffffffffffff00, 0x2005240, 0x1ffc978)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc4abac0918, 0xc8610fd000, 0x1000, 0x1000)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4abac0900, 0xc8610fd000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:126 +0x18a
net.(*netFD).Read(0xc4abac0900, 0xc8610fd000, 0x1000, 0x1000, 0xc4201cc080, 0x3, 0xc420022c00)
    /usr/local/go/src/net/fd_unix.go:202 +0x52
net.(*conn).Read(0xc555648010, 0xc8610fd000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:176 +0x6d
bufio.(*Reader).fill(0xc4b985e060)
    /usr/local/go/src/bufio/bufio.go:97 +0x11a
bufio.(*Reader).ReadByte(0xc4b985e060, 0x3, 0xc8f8752000, 0xc420022c70)
    /usr/local/go/src/bufio/bufio.go:239 +0x39
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).handleCommand(0xc4205184d0, 0xc4b985e060, 0xc4b985e0c0, 0xc7ff998070, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:515 +0x43
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).handleConn(0xc4205184d0, 0x200f200, 0xc4203af640, 0x20180e0, 0xc555648010)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:499 +0x243
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*NetworkTransport).listen
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:477 +0x1eb

goroutine 1486 [runnable]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).pipelineReplicate(0xc420252000, 0xc653c9f380, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:393 +0x49f
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate(0xc420252000, 0xc653c9f380)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:152 +0x2b5
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).startStopReplication.func1()
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/raft.go:451 +0x33
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a1651f00)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 1520 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).AcceptStream(0xc64485e8f0, 0xc6ae70bd08, 0xc6ae70bcf8, 0x102eacf)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:207 +0xf2
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).Accept(0xc64485e8f0, 0x160cac8, 0xc4201fa280, 0x200f200, 0xc60c19c0c0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:197 +0x2b
github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x20180e0, 0xc68eafc010, 0xc5bdc800a0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:211 +0x2cd
github.com/hashicorp/nomad/nomad.(*Server).handleConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x20180e0, 0xc68eafc010, 0xc5bdc800a0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:137 +0x309
created by github.com/hashicorp/nomad/nomad.(*Server).listen
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:93 +0x15b

goroutine 1564 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).serverHealthLoop(0xc5dd049f40)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:333 +0x163
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).Start
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:81 +0x15e

goroutine 1621 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad/drainer.(*deadlineHeap).watch(0xc8528b7f00)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drain_heap.go:61 +0x1fc
created by github.com/hashicorp/nomad/nomad/drainer.NewDeadlineHeap
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/drain_heap.go:46 +0x130

goroutine 1583 [select]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).waitForSchedulers(0xc4203303c0, 0xc6d88ebf30, 0x1, 0x1, 0xcb0b8fe180, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:483 +0x2c0
github.com/hashicorp/nomad/nomad.(*EvalBroker).Dequeue(0xc4203303c0, 0xc6d88ebf30, 0x1, 0x1, 0x3b9aca00, 0x0, 0x0, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:348 +0x106
github.com/hashicorp/nomad/nomad.(*Server).reapFailedEvaluations(0xc4201fa280, 0xc91388c780)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:501 +0xb7
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:230 +0x3ef

goroutine 1582 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*Server).schedulePeriodic(0xc4201fa280, 0xc91388c780)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:454 +0x2c8
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:227 +0x3bd

goroutine 1561 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*netPipeline).decodeResponses(0xc8528b7000)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:667 +0x11b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.newNetPipeline
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/net_transport.go:658 +0x117

goroutine 1578 [chan receive, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7f6944e00, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc80e136706, 0x1420740, 0xc8687ebb60, 0x24, 0x0, 0x0, 0x20af600, 0xc8f913a918)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc8687ebb60, 0xc8f913a918, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc94698fd48, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc94698fd48, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc8f913a598, 0xc6e473de80, 0xc8a62f3140, 0x1420740, 0xc8687ebb60, 0x16, 0x12903c0, 0xc8f913a918, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc83eba3480, 0x0, 0x0)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc65436d500, 0xc5bdc80190)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf

goroutine 1489 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530d9e0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f

goroutine 1604 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc70a052080, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5d0446606, 0x1420740, 0xc707654120, 0x24, 0x0, 0x0, 0x20af600, 0xc8564aa428)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc707654120, 0xc8564aa428, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc6d8869d48, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc6d8869d48, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73a6c8, 0xc6e473de80, 0xc8a3349540, 0x1420740, 0xc707654120, 0x16, 0x12903c0, 0xc8564aa428, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc798d04200, 0x0, 0x0)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc653afc000, 0xc5bdc80190)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf

goroutine 1616 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7052bc400, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5e08d3706, 0x1420740, 0xc76fedccc0, 0x24, 0x0, 0x0, 0x20af600, 0xc81c73b658)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc76fedccc0, 0xc81c73b658, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc8386efd48, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc8386efd48, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73b2c0, 0xc6e473de80, 0xc6a6305300, 0x1420740, 0xc76fedccc0, 0x16, 0x12903c0, 0xc81c73b658, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc85e7a6b40, 0x0, 0xc537843bc0)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc537843bc0, 0xc5bdc80190)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf

goroutine 1585 [select]:
github.com/hashicorp/nomad/nomad.(*Server).periodicUnblockFailedEvals(0xc4201fa280, 0xc91388c780)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:577 +0x137
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:236 +0x453

goroutine 1635 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530daa0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f

goroutine 1568 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).prune(0xc420134240)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:619 +0x141
created by github.com/hashicorp/nomad/nomad.(*BlockedEvals).SetEnabled
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:136 +0x93

goroutine 1588 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc70a350080, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5d0466606, 0x1420740, 0xc94f79a300, 0x24, 0x0, 0x0, 0x20af600, 0xc82a976598)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc94f79a300, 0xc82a976598, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc8386f1d48, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc8386f1d48, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc8564aa0a8, 0xc6e473de80, 0xc8a62cf8c0, 0x1420740, 0xc94f79a300, 0x16, 0x12903c0, 0xc82a976598, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc82eb71700, 0x0, 0x0)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc653728000, 0xc5bdc80190)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf

goroutine 1584 [select]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).GetDuplicates(0xc420134240, 0x3b9aca00, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:553 +0x1c8
github.com/hashicorp/nomad/nomad.(*Server).reapDupBlockedEvaluations(0xc4201fa280, 0xc91388c780)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:546 +0x7a
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:233 +0x421

goroutine 1581 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).run(0xc42033b500, 0x200f200, 0xc64d0e0600)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:334 +0x22e
created by github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).SetEnabled
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:175 +0x11c

goroutine 1619 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew(0x200f200, 0xc8528b7e80, 0xc8386f5748, 0x20, 0x20, 0x0, 0x6)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 +0x984
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0xc7a0b84000, 0x200f200, 0xc8528b7e80, 0x0, 0x74aa05)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch.go:82 +0x129
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc4203fda00, 0xc8386f5940, 0x74aa05, 0x200f200, 0xc8528b7e80, 0xc569e91528, 0xc8386f5930, 0xc8386f5950, 0xfc20df, 0xc701a28138)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:146 +0x19f
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).getJobAllocs(0xc701a280e0, 0x200f200, 0xc8528b7e80, 0x74aa05, 0x1, 0x1, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:423 +0xe2
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc701a280e0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:145 +0xd96
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0

goroutine 1615 [chan receive]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc7052bc300, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc5e08d3206, 0x1420740, 0xc76fedca20, 0x24, 0x0, 0x0, 0x20af600, 0xc81c73b0c8)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Eval).Update(0xc42051c1b0, 0xc76fedca20, 0xc81c73b0c8, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_endpoint.go:204 +0x229
reflect.Value.call(0xc42033be00, 0xc68eafc3a8, 0x13, 0x159a5af, 0x4, 0xc6d891bd48, 0x3, 0x3, 0x13c9200, 0x19, ...)
    /usr/local/go/src/reflect/value.go:434 +0x905
reflect.Value.Call(0xc42033be00, 0xc68eafc3a8, 0x13, 0xc6d891bd48, 0x3, 0x3, 0x16, 0x16, 0x16)
    /usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc7f503bcc0, 0xc5bdc80190, 0xc81c73ad18, 0xc6e473de80, 0xc6a6304c80, 0x1420740, 0xc76fedca20, 0x16, 0x12903c0, 0xc81c73b0c8, ...)
    /usr/local/go/src/net/rpc/server.go:381 +0x142
net/rpc.(*Server).ServeRequest(0xc5bdc80190, 0x200fac0, 0xc85e7a67c0, 0x0, 0x0)
    /usr/local/go/src/net/rpc/server.go:496 +0x22c
github.com/hashicorp/nomad/nomad.(*Server).handleNomadConn(0xc4201fa280, 0x200f200, 0xc60c19c0c0, 0x2017ea0, 0xc537843a40, 0xc5bdc80190)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:236 +0x239
created by github.com/hashicorp/nomad/nomad.(*Server).handleMultiplex
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:218 +0x2bf

goroutine 1575 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).heartbeat(0xc420252000, 0xc653c9f5c0, 0xc93dfcb4a0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:351 +0x550
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate.func1()
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:118 +0x3c
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a62f2f20)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 1519 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).heartbeat(0xc420252000, 0xc653c9f380, 0xc93fd44000)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:340 +0x2cc
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).replicate.func1()
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:118 +0x3c
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8a3349320)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 1603 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).keepalive(0xc64485e8f0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:308 +0x102
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:111 +0x455

goroutine 1567 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*BlockedEvals).watchCapacity(0xc420134240)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:415 +0x136
created by github.com/hashicorp/nomad/nomad.(*BlockedEvals).SetEnabled
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/blocked_evals.go:135 +0x71

goroutine 1620 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew(0x200f200, 0xc8528b7e00, 0xc9affeda70, 0x20, 0x20, 0x7fd5bf20c1e8, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 +0x984
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0xc7a09d0390, 0x200f200, 0xc8528b7e00, 0xc7a09d0480, 0x749350)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch.go:82 +0x129
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc4203fda00, 0xc9affedc68, 0x749350, 0x200f200, 0xc8528b7e00, 0x561c2e, 0xc4201c6230, 0x2, 0xc5751d2440, 0x40)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:146 +0x19f
github.com/hashicorp/nomad/nomad/drainer.(*nodeDrainWatcher).getNodes(0xc8528b7ec0, 0x749350, 0x3b, 0xc6d8938de8, 0x1, 0x1)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_nodes.go:213 +0xd0
github.com/hashicorp/nomad/nomad/drainer.(*nodeDrainWatcher).watch(0xc8528b7ec0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_nodes.go:150 +0x7f3
created by github.com/hashicorp/nomad/nomad/drainer.NewNodeDrainWatcher
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_nodes.go:141 +0xd2

goroutine 1634 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvaluatePool).run(0xc7a09d0210, 0xc8f530da40)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:104 +0x1fb
created by github.com/hashicorp/nomad/nomad.NewEvaluatePool
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply_pool.go:49 +0x14f

goroutine 1565 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*PlanQueue).Dequeue(0xc4201c6410, 0x0, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_queue.go:155 +0x162
github.com/hashicorp/nomad/nomad.(*Server).planApply(0xc4201fa280)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/plan_apply.go:59 +0xbf
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:192 +0x20e

goroutine 1562 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).pipelineDecode(0xc420252000, 0xc653c9f380, 0x200c580, 0xc8528b7000, 0xc91388c960, 0xc91388c9c0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:448 +0x14f
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).pipelineReplicate.func1()
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/replication.go:385 +0x57
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc420252000, 0xc8528b7040)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

goroutine 1521 [IO wait]:
internal/poll.runtime_pollWait(0x7fd576803680, 0x72, 0x0)
    /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc6e473c518, 0x72, 0xffffffffffffff00, 0x2005240, 0x1ffc978)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc6e473c518, 0xc9687fa000, 0x1000, 0x1000)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc6e473c500, 0xc9687fa000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:126 +0x18a
net.(*netFD).Read(0xc6e473c500, 0xc9687fa000, 0x1000, 0x1000, 0x0, 0x2a5, 0x0)
    /usr/local/go/src/net/fd_unix.go:202 +0x52
net.(*conn).Read(0xc68eafc010, 0xc9687fa000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:176 +0x6d
bufio.(*Reader).Read(0xc7a08099e0, 0xc705d03a90, 0xc, 0xc, 0xc, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:213 +0x30b
io.ReadAtLeast(0x1ffe600, 0xc7a08099e0, 0xc705d03a90, 0xc, 0xc, 0xc, 0xc, 0x0, 0x0)
    /usr/local/go/src/io/io.go:309 +0x86
io.ReadFull(0x1ffe600, 0xc7a08099e0, 0xc705d03a90, 0xc, 0xc, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:327 +0x58
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recvLoop(0xc64485e8f0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:451 +0xfc
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recv(0xc64485e8f0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:430 +0x2b
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:108 +0x3e4

goroutine 1602 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).send(0xc64485e8f0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:392 +0x2ca
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:109 +0x406

goroutine 1560 [chan receive, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*deferError).Error(0xc8fc1ca300, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/future.go:100 +0x75
github.com/hashicorp/nomad/nomad.(*Server).raftApply(0xc4201fa280, 0xc8973e9c8a, 0x12903c0, 0xc8973e9c00, 0x1, 0x1, 0x61, 0xc662b94b40, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/rpc.go:575 +0x83
github.com/hashicorp/nomad/nomad.(*Server).reconcileJobSummaries(0xc4201fa280, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:762 +0x1ad
github.com/hashicorp/nomad/nomad.(*Server).establishLeadership(0xc4201fa280, 0xc91388c780, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:260 +0x577
github.com/hashicorp/nomad/nomad.(*Server).leaderLoop(0xc4201fa280, 0xc91388c780)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:108 +0x54f
github.com/hashicorp/nomad/nomad.(*Server).monitorLeadership.func1(0xc6d87de040, 0xc4201fa280, 0xc91388c780)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:63 +0x5b
created by github.com/hashicorp/nomad/nomad.(*Server).monitorLeadership
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/leader.go:61 +0x1f6

goroutine 1569 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*EvalBatcher).batcher(0xc7a0a104b0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/batcher.go:63 +0x263
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.NewEvalBatcher
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/batcher.go:36 +0xd4

goroutine 1566 [select, 1 minutes]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).runDelayedEvalsWatcher(0xc4203303c0, 0x200f200, 0xc8528b7d80)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:761 +0x227
created by github.com/hashicorp/nomad/nomad.(*EvalBroker).SetEnabled
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:170 +0xf0

goroutine 1618 [select, 1 minutes]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.watchFew(0x200f200, 0xc8528b7dc0, 0xc420042cd8, 0x20, 0x20, 0x5, 0x1599762)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch_few.go:16 +0x984
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb.WatchSet.WatchCtx(0xc99fd11dd0, 0x200f200, 0xc8528b7dc0, 0x20d3d40, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-memdb/watch.go:82 +0x129
github.com/hashicorp/nomad/nomad/state.(*StateStore).BlockingQuery(0xc4203fda00, 0xc946993ec8, 0x1, 0x200f200, 0xc8528b7dc0, 0x405913, 0xc6ec3e4120, 0x16d2250, 0xc6ec3e4001, 0x646263)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/state/state_store.go:146 +0x19f
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*Watcher).getDeploys(0xc6d73ecf80, 0x200f200, 0xc8528b7dc0, 0x1, 0xc71aac2060, 0xc6ec3e4000, 0xc8ed83e040, 0xc420252000, 0x20002, 0xc94d66bfa4)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/deployments_watcher.go:176 +0x72
github.com/hashicorp/nomad/nomad/deploymentwatcher.(*Watcher).watchDeployments(0xc6d73ecf80, 0x200f200, 0xc8528b7dc0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/deployments_watcher.go:148 +0x231
created by github.com/hashicorp/nomad/nomad/deploymentwatcher.(*Watcher).SetEnabled
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/deploymentwatcher/deployments_watcher.go:121 +0xf2

goroutine 1563 [select]:
github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).run(0xc5dd049f40)
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:108 +0x1bb
created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot.(*Autopilot).Start
    /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/consul/agent/consul/autopilot/autopilot.go:80 +0x13c

goroutine 1738 [select]:
github.com/hashicorp/nomad/nomad.(*EvalBroker).waitForSchedulers.func1(0xc8750af020, 0xc99e3ec300, 0xc99e3ec240)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:469 +0xd1
created by github.com/hashicorp/nomad/nomad.(*EvalBroker).waitForSchedulers
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/eval_broker.go:468 +0x103
nickethier commented 6 years ago

Yeah I don't want you sharing those if you're templating in secret data. Many people use the vault integration so secrets aren't leaked into job specs.

I think we may have reproduced a condition that can cause memory bloat under your use case. I'll share in more detail tomorrow, but wanted to get a reply to you today.

Miserlou commented 6 years ago

Sounds good, looking forward to the fix.

Miserlou commented 6 years ago

Did you see my note about your internal load testing?

Because it still seems that you're advertising a product for production work loads that you've never actually tested at scale yourselves, which is obviously pretty astounding.

shantanugadgil commented 6 years ago

As an observer, I feel for Miserlou's pain and frustration as he keeps tackling the various issues he is facing.

Though, @Miserlou I have one thing to add; "You are lucky" ... Having HashiStack is "actually a good thing" ™️ 😁 as Nomad is extremely flexible yet adequately rigid.

I have been trying to convince folks to adopt HashiStack, though without much luck, mostly due to their off-the-radar reasons like: it is not well known (!) maintaining Nomad servers is an overhead (!!) ** we don't easily find people who know "Nomad" (vs. $alternative_stack ... whatever)

From personal experience having raw_exec in addition to docker has helped solved a bunch of problems very easily.

All that said ... coming back to issue at hand and some brainstorming ...

*** FWIW, I have my Nomad agents under systemd with a 10 second delay between restarts and a separate cron job which checks 'systemctl status nomad' to determine if a datdir wipe is needed (sometimes the "invalid task bucket" error pops up)

HTH, Regards, Shantanu

nickethier commented 6 years ago

Before I dig into the details of the fix I think it important to highlight that you may still be misusing Nomad. Nomad is a workload scheduler and is suitable as a replacment for a classic job queue system in many applications. However it is not designed to be a distributed durable queuing system such as Kafka for example. Nomad must hold all pending Jobs in memory to schedule them, it currently has no swap to disk feature and will crash if you DOS it.

This can happen for example if your elastic client pool suddenly needs to expand in response to an unexpected burst in workload. While new nodes are coming online, Jobs are piling up in memory, waiting to be placed. I think an architecture where you have a feedback loop and some backpressure into whats sending dispatch requests would add some resilience to your deployment. Testing the upper bound of your deployment and using that data to watch metrics like nomad.blocked_evals.total_blocked to know when to rate limit would help. We're taking note of some things here to get ideas on how to improve Nomad for usecases such as this, but this is my recommendation for the state of today.

As for the bug you really already nailed it in #4422. Each Job in Nomad maintains a small overhead of memory to track the nomad.job_summary.* stats. This is all fine for users who are using service jobs for example where there is a low cardinality of jobs with high counts. It is however problematic for usecases where there are arbitrarily large number of single count jobs since this overhead can add up significantly over 10s/100s of thousands of jobs. I tested this locally by dispatching 5k allocs to a nomad cluster and observed well over double the memory usage of the leader compared to the followers: nomad1 After profiling the heap it was apparent this was due to publishing metrics

Showing nodes accounting for 88.87MB, 60.27% of 147.45MB total
Dropped 46 nodes (cum <= 0.74MB)
Showing top 10 nodes out of 84
      flat  flat%   sum%        cum   cum%
       1MB  0.68%  0.68%    93.37MB 63.32%  github.com/hashicorp/nomad/nomad.(*Server).publishJobSummaryMetrics
   24.50MB 16.62% 17.30%    92.37MB 62.64%  github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*Metrics).SetGaugeWithLabels
         0     0% 17.30%    92.37MB 62.64%  github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.SetGaugeWithLabels
   35.86MB 24.32% 41.61%    67.86MB 46.02%  github.com/hashicorp/nomad/vendor/github.com/armon/go-metrics.(*InmemSink).SetGaugeWithLabels
         0     0% 41.61%    50.51MB 34.25%  github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1

I'm introducing a couple new flags to the telemetry config stanza to help with this:

Specifically disable_dispatched_job_summary_metrics is what you want. It should allow you to turn telemetry back on as it will filter all all of the job_summary stats for your dispatched jobs.

When doing this with my local cluster I was able to reduce the memory usage of my leader by over %40. nomad2

I'm working to get this merge into our 0.8.7 branch so we can get this out ASAP. That PR is here: #4878

Miserlou commented 6 years ago

Hey Nick, big thanks for the deep dive on this. We've added a limiter for the number of dispatched jobs, which are all stored in a database anyway, so hopefully that gives us a safety buffer.

We are now well aware that Nomad isn't a suitable tool for our needs - however we got this impression because you are lying to people about Nomad's capabilities both online and at conferences. Dispatch jobs are a hack and it's dishonest to advertise them as a replacement for queues, even though the title of your blog post is "Replacing Queues with Nomad Dispatch", without any mention of these crippling limitations. I'm not even going to start on the inability to handle job timeouts, the lack of forward-looking prioritized job scheduling, lack of constraints across dispatched jobs, lack of recommendations for server:job ratios, lack of load testing results, lack of clients publicly using the stack, etc..

I also don't think that disable_dispatched_job_summary_metrics is a great solution to this problem (at least as I understand it) - a good solution would be to give us actually usable dispatch job metrics. As far as I can see it, you've left the problem in place, so the next person who gets duped by your marketing materials is going to fall into the same trap until they find this ticket and realize that there's a "make it suck less" argument that they need to set.

Anyway, I gotta say I really do appreciate you taking the time to finally address this issue and I'm really looking forward to the patch landing, which will likely be a big help to us.

Fuco1 commented 5 years ago

Yep, fell into the same trap. I've been fighting with the dispatch jobs for the last week.

My current idea is to simply schedule consumers with nomad and leave the actual job metadata in a queue from which they will pull the tasks. Sadly this means complete re-engineering of the entire architecture.

Maybe I'll try the backpressure trick as mentioned by @Miserlou (store jobs in a queue and have nomad periodically fetch some and dispatch them). With an aggresive GC this might somewhat work.

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.