allocations stuck in pending (v0.7.1)

dhv commented 6 years ago

Nomad version

v0.7.1

Operating system and Environment details

Ubuntu 16.04 Xenial

Issue

Some allocations will stay stuck in pending and the deployment fails. Usually it shows "Building Task Directory" event or the "Downloading image" event from docker. I've confirmed on the client that the image has finished downloading. Only way I've found to get it unstuck is to restart the nomad client.

Reproduction steps

No deterministic way to reproduce other than to keep submitting jobs. Might have to do with the constraints being used on this particular job, but I have seen it happen on other jobs as well.

With this particular job, I have it pinned to a specific host with a constraint. The new alloc first gets stuck in the state shown below because there's an older version of the same job. Then, I force garbage collection and it's able to clean up the old allocation. It then proceeds to "Building Task Directory" and gets stuck there, until I restart the nomad client.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

ID                  = f9650af5
Eval ID             = de0ba8e6
Name                = REDACTED-alertmanager.monitoring[0]
Node ID             = 3d3a8b6d
Job ID              = REDACTED-alertmanager
Job Version         = 15
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1m39s ago
Modified            = 1m39s ago
Deployment ID       = cae77904
Deployment Health   = unset

Edit: Realized I copy and pasted an older alloc from the one below, but I can reproduce this behavior for the same allocation. The main issue is that it gets stuck after the "Building Task Directory" event.

Force gc, then:

ID                  = 7b1356a6
Eval ID             = 1794e81b
Name                = REDACTED-alertmanager.monitoring[0]
Node ID             = 3d3a8b6d
Job ID              = REDACTED-alertmanager
Job Version         = 17
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 4m41s ago
Modified            = 1m41s ago
Deployment ID       = cfba3bb6
Deployment Health   = unhealthy

Task "alertmanager" is "pending"
Task Resources
CPU       Memory   Disk     IOPS  Addresses
1000 MHz  1.0 GiB  4.0 GiB  0     api: REDACTED:9093

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                   Type             Description
03/02/18 14:42:39 PST  Alloc Unhealthy  Task not running by deadline
03/02/18 14:39:39 PST  Task Setup       Building Task Directory
03/02/18 14:39:39 PST  Received         Task received by client

Logs:

2018/03/02 22:38:05.628291 [DEBUG] client: updated allocations at index 814439 (total 7) (pulled 0) (filtered 7)
2018/03/02 22:38:05.628458 [DEBUG] client: allocs: (added 0) (removed 1) (updated 0) (ignore 7)
2018/03/02 22:38:05.628526 [INFO] client.gc: marking allocation cf7577f4-8928-7996-9c21-bc18e5666454 for GC
2018/03/02 22:38:05.628594 [INFO] client.gc: garbage collecting allocation cf7577f4-8928-7996-9c21-bc18e5666454 due to forced collection
2018/03/02 22:38:05.628628 [DEBUG] client: alloc "f9650af5-8c87-d6bb-6672-4cc22c78f8a3" copying previous alloc "cf7577f4-8928-7996-9c21-bc18
e5666454"
2018/03/02 22:39:39.056799 [DEBUG] client: starting task runners for alloc '7b1356a6-74b9-69d6-cd6d-60bca1111681'
2018/03/02 22:39:39.057062 [DEBUG] client.alloc_watcher: deadline (3m0s) for alloc "7b1356a6-74b9-69d6-cd6d-60bca1111681" is at 2018-03-02 22:42:39.056935478 +0000 UTC 
2018/03/02 22:39:39.057106 [DEBUG] client: starting task context for 'alertmanager' (alloc '7b1356a6-74b9-69d6-cd6d-60bca1111681')
2018/03/02 22:39:39 [INFO] (runner) creating new runner (dry: false, once: false)
2018/03/02 22:39:39 [DEBUG] (runner) final config: REDACTED
2018/03/02 22:39:39 [INFO] (runner) creating watcher
2018/03/02 22:39:39 [INFO] (runner) starting
2018/03/02 22:39:39 [DEBUG] (runner) running initial templates
2018/03/02 22:39:39 [INFO] (runner) initiating run
2018/03/02 22:39:39 [DEBUG] (runner) checking template b21ae8352fbb155882c9a9dea3032f32
2018/03/02 22:39:39 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/7b1356a6-74b9-69d6-cd6d-60bca1111681/alertmanager/local/config/config.yml"
2018/03/02 22:39:39 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/7b1356a6-74b9-69d6-cd6d-60bca1111681/alertmanager/local/config/config.yml"
2018/03/02 22:39:39 [DEBUG] (runner) diffing and updating dependencies
2018/03/02 22:39:39 [DEBUG] (runner) watching 0 dependencies
2018/03/02 22:39:39 [DEBUG] (runner) all templates rendered
2018/03/02 22:39:39.094790 [DEBUG] client: unblocking task alertmanager for alloc "7b1356a6-74b9-69d6-cd6d-60bca1111681": Template
2018/03/02 22:39:39.096841 [DEBUG] driver.docker: image "prom/alertmanager:v0.14.0" (sha256:23744b2d645c0574015adfba4a90283b79251aee3169dbe67f335d8465a8a63f) reference count incremented: 3
2018/03/02 22:39:39.186414 [DEBUG] client: updated allocations at index 814454 (total 8) (pulled 0) (filtered 8)
2018/03/02 22:39:39.186494 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 8)
2018/03/02 22:39:43.444857 [DEBUG] http: Request /v1/agent/health?type=client (466.085µs)
2018/03/02 22:39:46.907183 [DEBUG] http: Request /v1/client/allocation/7b1356a6-74b9-69d6-cd6d-60bca1111681/stats?region=global (1.169769ms)
2018/03/02 22:39:53.471335 [DEBUG] http: Request /v1/agent/health?type=client (2.980292ms)
2018/03/02 22:40:03.494777 [DEBUG] http: Request /v1/agent/health?type=client (223.423µs)
2018/03/02 22:40:12.970078 [DEBUG] http: Request /v1/client/allocation/7b1356a6-74b9-69d6-cd6d-60bca1111681/stats?region=global (310.832µs)
2018/03/02 22:40:13.520357 [DEBUG] http: Request /v1/agent/health?type=client (2.082071ms)
2018/03/02 22:40:23.544361 [DEBUG] http: Request /v1/agent/health?type=client (258.436µs)
2018/03/02 22:40:33.568069 [DEBUG] http: Request /v1/agent/health?type=client (311.246µs)
2018/03/02 22:40:43.591419 [DEBUG] http: Request /v1/agent/health?type=client (262.24µs)
2018/03/02 22:40:53.615101 [DEBUG] http: Request /v1/agent/health?type=client (300.035µs)
2018/03/02 22:41:03.638927 [DEBUG] http: Request /v1/agent/health?type=client (221.217µs)
2018/03/02 22:41:06.483439 [DEBUG] http: Request /v1/client/allocation/7b1356a6-74b9-69d6-cd6d-60bca1111681/stats?region=global (747.245µs)
2018/03/02 22:41:13.662400 [DEBUG] http: Request /v1/agent/health?type=client (247.43µs)
2018/03/02 22:41:23.686115 [DEBUG] http: Request /v1/agent/health?type=client (321.177µs)
2018/03/02 22:41:33.709703 [DEBUG] http: Request /v1/agent/health?type=client (251.064µs)
2018/03/02 22:41:43.732748 [DEBUG] http: Request /v1/agent/health?type=client (253.07µs)
2018/03/02 22:41:53.756417 [DEBUG] http: Request /v1/agent/health?type=client (253.271µs)
2018/03/02 22:42:03.779746 [DEBUG] http: Request /v1/agent/health?type=client (222.659µs)
2018/03/02 22:42:13.803207 [DEBUG] http: Request /v1/agent/health?type=client (313.132µs)
2018/03/02 22:42:23.826753 [DEBUG] http: Request /v1/agent/health?type=client (253.553µs)
2018/03/02 22:42:33.850701 [DEBUG] http: Request /v1/agent/health?type=client (257.419µs)
2018/03/02 22:42:39.057280 [DEBUG] client: event from "Alloc Unhealthy" for task "alertmanager" in alloc "7b1356a6-74b9-69d6-cd6d-60bca1111681": Task not running by deadline

Then, restart nomad client , task gets unstuck and it works fine:

03/02/18 15:07:14 PST  Started          Task started by client
03/02/18 14:42:39 PST  Alloc Unhealthy  Task not running by deadline
03/02/18 14:39:39 PST  Task Setup       Building Task Directory
03/02/18 14:39:39 PST  Received         Task received by client

Logs:

==> Starting Nomad agent...
==> Nomad agent configuration:

                Client: true
             Log Level: DEBUG
                Region: REDACTED (DC: REDACTED)
                Server: false
               Version: 0.7.1

==> Nomad agent started! Log data will stream in below:

2018/03/02 23:07:08.241498 [INFO] client: using state directory /var/lib/nomad/client
2018/03/02 23:07:08.241677 [INFO] client: using alloc directory /var/lib/nomad/alloc
2018/03/02 23:07:08.247774 [DEBUG] client: built-in fingerprints: [arch cgroup consul cpu host memory network nomad signal storage vault env_aws env_gce]
2018/03/02 23:07:08.247939 [INFO] fingerprint.cgroups: cgroups are available
2018/03/02 23:07:12 [DEBUG] (runner) diffing and updating dependencies
2018/03/02 23:07:12 [DEBUG] (runner) watching 0 dependencies
2018/03/02 23:07:12 [DEBUG] (runner) all templates rendered
2018/03/02 23:07:08.248069 [DEBUG] client: fingerprinting cgroup every 15s
2018/03/02 23:07:08.248890 [INFO] fingerprint.consul: consul agent is available
2018/03/02 23:07:08.248993 [DEBUG] client: fingerprinting consul every 15s
2018/03/02 23:07:08.251049 [DEBUG] fingerprint.cpu: frequency: 2300 MHz
2018/03/02 23:07:08.251054 [DEBUG] fingerprint.cpu: core count: 32
2018/03/02 23:07:08.260224 [DEBUG] fingerprint.network: link speed for bond0 set to 2000
2018/03/02 23:07:08.260296 [DEBUG] fingerprint.network: Detected interface bond0 with IP:
2018/03/02 23:07:08.584200 [INFO] fingerprint.vault: Vault is available
2018/03/02 23:07:08.584291 [DEBUG] client: fingerprinting vault every 15s
2018/03/02 23:07:10.584415 [DEBUG] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
2018/03/02 23:07:12.584628 [DEBUG] fingerprint.env_gce: Could not read value for attribute "machine-type"
2018/03/02 23:07:12.584637 [DEBUG] fingerprint.env_gce: Error querying GCE Metadata URL, skipping
2018/03/02 23:07:12.705574 [DEBUG] client: unblocking task alertmanager for alloc "f9650af5-8c87-d6bb-6672-4cc22c78f8a3": Template
2018/03/02 23:07:12.584653 [DEBUG] client: applied fingerprints [arch cgroup consul cpu host memory network nomad signal storage vault]
2018/03/02 23:07:12.584763 [DEBUG] driver.docker: using client connection initialized from environment
2018/03/02 23:07:12.584872 [DEBUG] client: fingerprinting rkt every 15s
2018/03/02 23:07:12.587383 [DEBUG] driver.exec: exec driver is enabled
2018/03/02 23:07:12.587393 [WARN] driver.raw_exec: raw exec is enabled. Only enable if needed
2018/03/02 23:07:12.587430 [DEBUG] client: fingerprinting docker every 15s
2018/03/02 23:07:12.587491 [DEBUG] client: fingerprinting exec every 15s
2018/03/02 23:07:12.667696 [DEBUG] client: available drivers [docker exec raw_exec java]
2018/03/02 23:07:12.669334 [DEBUG] client.consul: bootstrap contacting following Consul DCs: [REDACTED]
2018/03/02 23:07:12.669762 [INFO] driver.docker: re-attaching to docker process: 407ca43083f8e4450f7c0226dac3c4600f229d45aa0c5cd4820f329ebb81dc8a
2018/03/02 23:07:12.669768 [DEBUG] driver.docker: re-attached to handle: DOCKER:{"Version":"0.7.1","Image":"justwatch/elasticsearch_exporter:1.0.1","ImageID":"sha256:5c20061f20c7b85a0ee9152ee080cae5ecf8facf51040a15f028d37cf234de5f","ContainerID":"407ca43083f8e4450f7c0226dac3c4600f229d45aa0c5cd4820f329ebb81dc8a","KillTimeout":5000000000,"MaxKillTimeout":30000000000,"PluginConfig":{"Pid":21693,"AddrNet":"unix","AddrName":"/tmp/plugin192186907"}}
2018/03/02 23:07:12.672645 [DEBUG] driver.docker: version of executor: 1.1.0
2018/03/02 23:07:12.672676 [DEBUG] driver.docker: image "justwatch/elasticsearch_exporter:1.0.1" (sha256:5c20061f20c7b85a0ee9152ee080cae5ecf8facf51040a15f028d37cf234de5f) reference count incremented: 1
2018/03/02 23:07:12.672902 [DEBUG] client: starting task context for 'elasticsearch-exporter' (alloc '029151f6-5d92-5ad7-5a79-c98169677b51')
2018/03/02 23:07:12.673744 [DEBUG] client: starting task runners for alloc '029151f6-5d92-5ad7-5a79-c98169677b51'
2018/03/02 23:07:12.674999 [INFO] driver.docker: re-attaching to docker process: 04d0e654648663dbddd8354436b3967f3fe050a2f50e3616533943bb43912989
2018/03/02 23:07:12.675028 [WARN] client: error fetching stats of task elasticsearch-exporter: stats collection hasn't started yet
2018/03/02 23:07:12.675111 [DEBUG] driver.docker: re-attached to handle: DOCKER:{"Version":"0.7.1","Image":"prom/node-exporter:v0.14.0","ImageID":"sha256:bb4a6b7746588f44756a7166c9721c587f0f1d068d5b61fce96622885e9ef641","ContainerID":"04d0e654648663dbddd8354436b3967f3fe050a2f50e3616533943bb43912989","KillTimeout":5000000000,"MaxKillTimeout":30000000000,"PluginConfig":{"Pid":21692,"AddrNet":"unix","AddrName":"/tmp/plugin308744399"}}
2018/03/02 23:07:12.677504 [DEBUG] driver.docker: version of executor: 1.1.0
2018/03/02 23:07:12.677518 [DEBUG] driver.docker: image "prom/node-exporter:v0.14.0" (sha256:bb4a6b7746588f44756a7166c9721c587f0f1d068d5b61fce96622885e9ef641) reference count incremented: 1
2018/03/02 23:07:12.677646 [DEBUG] client: starting task context for 'node-exporter' (alloc '6740d77d-51ce-6281-f6d6-9917cfc8f3c8')
2018/03/02 23:07:12.677868 [DEBUG] client: starting task runners for alloc '6740d77d-51ce-6281-f6d6-9917cfc8f3c8'
2018/03/02 23:07:12.678060 [DEBUG] client: starting task context for 'alertmanager' (alloc '7b1356a6-74b9-69d6-cd6d-60bca1111681')
2018/03/02 23:07:12.678230 [DEBUG] client: starting task runners for alloc '7b1356a6-74b9-69d6-cd6d-60bca1111681'
2018/03/02 23:07:12.678259 [WARN] client: error fetching stats of task node-exporter: stats collection hasn't started yet
2018/03/02 23:07:12 [INFO] (runner) creating new runner (dry: false, once: false)
2018/03/02 23:07:12.678595 [INFO] driver.docker: re-attaching to docker process: 4d916e7b4dd652d15807a53f74f640003da9bea9b44dbe7504c83e4dc9dc70c6
2018/03/02 23:07:12.678602 [DEBUG] driver.docker: re-attached to handle: DOCKER:{"Version":"0.7.1","Image":"REDACTED","ImageID":"sha256:b31cdf5e97996a59913883cb4617327981e0592f7c64f201123e1d6e49dc819d","ContainerID":"4d916e7b4dd652d15807a53f74f640003da9bea9b44dbe7504c83e4dc9dc70c6","KillTimeout":30000000000,"MaxKillTimeout":30000000000,"PluginConfig":{"Pid":11518,"AddrNet":"unix","AddrName":"/tmp/plugin452957955"}}
    2018/03/02 23:07:12 [DEBUG] (runner) final config: {"Consul":{"Address":"127.0.0.1:8500","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":33,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},
"Templates":[{} TRUNCATED..
    2018/03/02 23:07:12 [INFO] (runner) creating watcher
    2018/03/02 23:07:12 [INFO] (runner) starting
    2018/03/02 23:07:12 [DEBUG] (runner) running initial templates
    2018/03/02 23:07:12 [INFO] (runner) initiating run
    2018/03/02 23:07:12 [DEBUG] (runner) checking template b21ae8352fbb155882c9a9dea3032f32
    2018/03/02 23:07:12 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/7b1356a6-74b9-69d6-cd6d-60bca1111681/alertmanager/local/config/config.yml"
    2018/03/02 23:07:12 [DEBUG] (runner) diffing and updating dependencies
    2018/03/02 23:07:12 [DEBUG] (runner) watching 0 dependencies
    2018/03/02 23:07:12 [DEBUG] (runner) all templates rendered
    2018/03/02 23:07:12.681657 [DEBUG] client: unblocking task alertmanager for alloc "7b1356a6-74b9-69d6-cd6d-60bca1111681": Template
    2018/03/02 23:07:12.682105 [DEBUG] driver.docker: version of executor: 1.1.0
    2018/03/02 23:07:12.682118 [DEBUG] driver.docker: image "REDACTED" (sha256:b31cdf5e97996a59913883cb4617327981e0592f7c64f201123e1d6e49dc819d) reference count incremented: 1
    2018/03/02 23:07:12.682401 [DEBUG] client: starting task context for 'nginx' (alloc 'bb5d9d19-4835-5303-f43c-f410943f984c')
    2018/03/02 23:07:12.682544 [INFO] driver.docker: re-attaching to docker process: 3e886ab2a77a936a8817a25b2cb94d6051d25e556bf8ae5d8cb0920212ce4a04
    2018/03/02 23:07:12.682549 [DEBUG] driver.docker: re-attached to handle: DOCKER:{"Version":"0.7.1","Image":"REDACTED","ImageID":"sha256:37fa07e0379abd71c4fd299abb99c1d3a049f0e7ea9e7ac5df88c5d6020e776e","ContainerID":"3e886ab2a77a936a8817a25b2cb94d6051d25e556bf8ae5d8cb0920212ce4a04","KillTimeout":5000000000,"MaxKillTimeout":30000000000,"PluginConfig":{"Pid":11481,"AddrNet":"unix","AddrName":"/tmp/plugin197559497"}}
    2018/03/02 23:07:12 [INFO] (runner) creating new runner (dry: false, once: false)
    2018/03/02 23:07:12.683033 [WARN] client: error fetching stats of task nginx: stats collection hasn't started yet
    2018/03/02 23:07:12 [DEBUG] (runner) final config: REDACTED
    2018/03/02 23:07:12.684716 [DEBUG] driver.docker: image "prom/alertmanager:v0.14.0" (sha256:23744b2d645c0574015adfba4a90283b79251aee3169dbe67f335d8465a8a63f) reference count incremented: 1

Job file (if appropriate)

# -*- hcl -*-
job "REDACTED-alertmanager" {
  region = "REDACTED"
  datacenters = ["REDACTED"]

  type = "service"

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  constraint {
    attribute = "${node.unique.name}"
    value     = "REDACTED"
  }

  update {
    max_parallel = 1
    min_healthy_time = "30s"
    healthy_deadline = "3m"
    auto_revert = false
    canary = 0
  }

  group "monitoring" {
    count = 1

    restart {
      # The number of attempts to run the job within the specified interval.
      attempts = 3
      interval = "5m"

      delay = "25s"
      mode = "delay"
    }

    ephemeral_disk {
      sticky  = true
      size = 4096
    }

    task "alertmanager" {
      driver = "docker"
      config {
        image = "prom/alertmanager:v0.14.0"
        volumes = [
          "local/config:/etc/alertmanager:ro",
        ]
        mounts = [
          {
            target = "/alertmanager"
            source = "alertmanager-${NOMAD_GROUP_NAME}-${NOMAD_TASK_NAME}-${NOMAD_ALLOC_INDEX}"
            readonly = false
          }
        ]
      }

      logs {
        max_files     = 50
        max_file_size = 10
      }

      resources {
        cpu    = 1000
        memory = 1024
        network {
          mbits = 10
          port "api" {
            static = 9093
          }
        }
      }

      service {
        name = "alertmanager"
        port = "api"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      template {
        destination = "local/config/config.yml"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        splay = "5s"
        left_delimiter = "[["
        right_delimiter = "[["

        data = REDACTED
      }
    }
  }

I will update with more info as I encounter this.

chelseakomlo commented 6 years ago

Hi, thanks for reporting this issue. There is quite a bit at play here so I want to make sure I understand how to reproduce this issue.

Steps to reproduce: 1) Submit a job (constrained to a single node) for which the allocation succeeds 2) Update the job (constrained to the same node) which is expected to succeed 3) Force garbage collect the old allocation from Step 1 3) The expected-to-succeed allocation from Step 2 becomes stuck in pending after attempting to set up the task directory.

Is there a specific reason this job is constrained to a single host?

At first glance, this looks similar to https://github.com/hashicorp/nomad/issues/3121 but further information will be helpful.

dhv commented 6 years ago

This is not limited to jobs which are constrained to single hosts. I just deployed a system scheduled monitoring job which stays in "pending" state indefinitely on a few clients. This is different from #3121 in that it is not specific to long image downloads, and does not resolve unless I restart the nomad client itself. It clears immediately once nomad restarts, always works.

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
06a73888  68501606  monitoring  1        run      pending  1h23m ago  1h23m ago
2d6792c8  2877565f  monitoring  1        run      pending  1h23m ago  1h23m ago
30803133  ecc9c006  monitoring  1        run      pending  1h23m ago  1h23m ago
314e0f54  925b929a  monitoring  1        run      running  1h23m ago  1h22m ago
5b567fb9  182fd8e3  monitoring  1        run      running  1h23m ago  1h23m ago
69e3deac  70e6d284  monitoring  1        run      running  1h23m ago  1h22m ago
70676651  ae2ec5ca  monitoring  1        run      pending  1h23m ago  1h23m ago
82b17a73  69f0e2d4  monitoring  1        run      running  1h23m ago  1h23m ago
8ff65f9a  335994ec  monitoring  1        run      running  1h23m ago  1h22m ago
91bcf6ac  4ef48180  monitoring  1        run      running  1h23m ago  1h23m ago
b2c80f09  87842616  monitoring  1        run      running  1h23m ago  1h22m ago
d0e8f661  140244bb  monitoring  1        run      running  1h23m ago  1h22m ago
dd6b4946  49874349  monitoring  1        run      running  1h23m ago  1h23m ago
df5786ad  06999874  monitoring  1        run      running  1h23m ago  1h22m ago
e1deec2b  e38b936c  monitoring  1        run      running  1h23m ago  1h23m ago
f7eb1533  f37fb995  monitoring  1        run      running  1h23m ago  1h23m ago

nomad alloc-status -region REGION 06a73888
ID                  = 06a73888
Eval ID             = 88334edb
Name                = REGION-cadvisor.monitoring[0]
Node ID             = 68501606
Job ID              = REGION-cadvisor
Job Version         = 1
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1h24m ago
Modified            = 1h24m ago

nomad alloc-status -region REGION 2d6792c8
ID                  = 2d6792c8
Eval ID             = 88334edb
Name                = REGION-cadvisor.monitoring[0]
Node ID             = 2877565f
Job ID              = REGION-cadvisor
Job Version         = 1
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1h24m ago
Modified            = 1h24m ago

nomad alloc-status -region REGION 30803133
ID                  = 30803133
Eval ID             = 88334edb
Name                = REGION-cadvisor.monitoring[0]
Node ID             = ecc9c006
Job ID              = REGION-cadvisor
Job Version         = 1
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 1h24m ago
Modified            = 1h24m ago

I am trying to narrow down the reproduction steps. Let me know what other logs or config I can dump to help.

vincenthuynh commented 6 years ago

Hey, try seeing if your nomad Raft log indexes are in sync on all servers: nomad agent-info | grep last_log_index

We encountered this as well in our cluster and it seemed to be the reason. https://github.com/hashicorp/nomad/issues/3227

dadgar commented 6 years ago

@dhv Next time this happens would you be willing to kill the particular nomad agent that isn't starting the allocations with a SIGQUIT (3) so we can get a stack dump.

dadgar commented 6 years ago

@dhv Are the allocations getting stuck always Docker driver? What else do they have in common, are they all using templates, vault, service definitions, etc? Could you potentially share the job file of a few that have gotten stuck

dmartin-isp commented 6 years ago

I've got something similar with nomad 8.3. We often get jobs stuck in "pending". I'll check the Raft log indexes and I'll try SIGQUIT.

shilov commented 6 years ago

Bumped into this as well. Restarting the agent seems to help.

camerondavison commented 6 years ago

@dadgar I am running 0.8.3 and have an allocation that is stuck in pending, but the node that it is allocated to is down so I cannot run SIGQUIT any idea what else could get it unstuck?

Allocations
ID        Node ID   Task Group                   Version  Desired  Status   Created     Modified
cc6b1a9c  9b9f60bb  ___-group  0        run      pending  19h49m ago  19h49m ago
bd7b6760  6ed39d79  ___-group  0        stop     lost     19h50m ago  19h49m ago

ID            = 9b9f60bb
Name          = ip-172-32-15-26
Class         = worker
DC            = ___
Drain         = false
Eligibility   = eligible
Status        = down
Driver Status = docker,exec,raw_exec,rkt

Node Events
Time                  Subsystem  Message
2018-06-15T23:12:39Z  Cluster    Node Registered

Allocated Resources
CPU           Memory       Disk         IOPS
0/120000 MHz  0 B/156 GiB  0 B/462 GiB  0/0

Allocation Resource Utilization
CPU           Memory
0/120000 MHz  0 B/156 GiB

preetapan commented 6 years ago

Hi @camerondavison!

What does nomad node status 9b9f60bb for that node that's down return for you?

Strange that the client status didn't get updated to lost when your node went down.

Could you try running nomad job eval <jobid>. That should force a scheduler evaluation and if the node is marked as down that'll transition any pending allocs to lost.

camerondavison commented 6 years ago

what I posted above was the node status sorry if that was not clear.

$ nomad node-status 9b9f60bb

error fetching node stats: Unexpected response code: 404 (No path to node)
ID            = 9b9f60bb
Name          = ip-172-32-15-26
Class         = worker
DC            = ____
Drain         = false
Eligibility   = eligible
Status        = down
Driver Status = docker,exec,raw_exec,rkt

Node Events
Time                  Subsystem  Message
2018-06-15T23:12:39Z  Cluster    Node Registered

Allocated Resources
CPU           Memory       Disk         IOPS
0/120000 MHz  0 B/156 GiB  0 B/462 GiB  0/0

Allocation Resource Utilization
CPU           Memory
0/120000 MHz  0 B/156 GiB

error fetching node stats: actual resource usage not present

Allocations
ID        Node ID   Task Group                   Version  Desired  Status   Created     Modified
cc6b1a9c  9b9f60bb  ____-group  0        run      pending  20h48m ago  20h48m ago

And the eval

$ nomad alloc-status -verbose cc6b1a9c
ID                  = cc6b1a9c-ebe1-ddbf-8131-163b4c6fb138
Eval ID             = 311cd23e-b2ed-91fc-0fce-b79d9fad326a
Name                = ___-job/dispatch-1529435710-54c7a659.____-group[0]
Node ID             = 9b9f60bb-a79d-3133-431e-b7d5e287003d
Job ID              = ___-job/dispatch-1529435710-54c7a659
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 2018-06-19T14:15:32-05:00
Modified            = 2018-06-19T14:15:32-05:00
Evaluated Nodes     = 3
Filtered Nodes      = 0
Exhausted Nodes     = 1
Allocation Time     = 75.566µs
Failures            = 0

Couldn't retrieve stats: Unexpected response code: 404 (No path to node)

Placement Metrics
  * Resources exhausted on 1 nodes
  * Class "worker" exhausted on 1 nodes
  * Dimension "cpu" exhausted on 1 nodes
  * Score "9b9f60bb-a79d-3133-431e-b7d5e287003d.binpack" = 10.829454
  * Score "089c9fa5-9e2c-8da0-7ea1-a7b633ca3e5f.binpack" = 6.519062
$ nomad eval-status -verbose 311cd23e-b2ed-91fc-0fce-b79d9fad326a
ID                 = 311cd23e-b2ed-91fc-0fce-b79d9fad326a
Status             = complete
Status Description = complete
Type               = batch
TriggeredBy        = node-update
Node ID            = 6ed39d79-a7b5-5134-57dc-32233450a276
Priority           = 50
Placement Failures = false
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>

preetapan commented 6 years ago

That looks like a potential bug in the scheduler because it didn't mark the desired state of cc6b1a9c as stop after the node update that marked that node as down. I'll dig into this some more and update this thread.

Could you also post the output of nomad job status. Specifically, I wanted to know if it did actually create a replacement allocation for the alloc that ended up in this pending state.

Did running another evaluate on the job like I mentioned above with the nomad job eval <jobid> do anything?

camerondavison commented 6 years ago

oh sorry @preetapan I did not realize that you wanted me to force a new eval of the job. I thought that you just wanted the eval status. Looks like after 24 hours maybe a garbage collector or something ran and it got re-evaled by itself.

note

$ nomad node-status 9b9f60bb
No node(s) with prefix "9b9f60bb" found

nomad job status ___-job/dispatch-1529435710-54c7a659
ID            = ___-job/dispatch-1529435710-54c7a659
Name          = ___-job/dispatch-1529435710-54c7a659
Submit Date   = 2018-06-19T14:15:10-05:00
Type          = batch
Priority      = 50
Datacenters   = ___
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                   Queued  Starting  Running  Failed  Complete  Lost
___-group  0       0         0        1       0         2

Allocations
ID        Node ID   Task Group                   Version  Desired  Status  Created    Modified
873a418f  c050b1ac  ___-group  0        run      failed  6m38s ago  5m30s ago
cc6b1a9c  9b9f60bb  ___-group  0        stop     lost    1d14m ago  6m38s ago
bd7b6760  6ed39d79  ___-group  0        stop     lost    1d14m ago  1d14m ago

The failure was expected, but not the lost

preetapan commented 6 years ago

@camerondavison I opened another issue for this #4437. The other issues on this ticket indicate something on the client kept jobs at pending, but the situation you described today is different. Didn't want to keep piling onto this ticket. Lets continue the discussion on that ticket.

If you still have that alloc and job, can you also add the full json response from our API for the alloc, the job, and any evaluations for that job (via nomad job -evals jobid) to #4437. Sorry for the back and forth on this one!

stale[bot] commented 5 years ago

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad