hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

Template: Allocation remains pending forever if consul key cannot be read. #2205

Closed cyrilgdn closed 7 years ago

cyrilgdn commented 7 years ago

Nomad version

Nomad v0.5.2

Issue

I created a simple test job (see below) which try to download a consul key in a file with: data = "{{ key \"configtest\" }}"

When I run this job, if the read of consul key return a 404 (key does not exists or, in my case, key is protected by ACL), the allocation remains in Pending state forever (an job in Starting state)

Is this a wanted behavior? Maybe the job should failed after a few tries?

Reproduction steps

# Run the job (see Job File below)
~ nomad run template-test.nomad

# Job status remains in this state:
~ nomad status  template-test                       
ID          = template-test
Name        = template-test
Type        = batch
Priority    = 50
Datacenters = dc1
Status      = running
Periodic    = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
template-test  0       1         0        0       0         0

Allocations
ID        Eval ID   Node ID   Task Group     Desired  Status   Created At
47847e8c  efe9ec58  6a744838  template-test  run      pending  01/17/17 10:17:56 CET

# Allocation status
~ nomad alloc-status 47847e8c
ID                 = 47847e8c
Eval ID            = efe9ec58
Name               = template-test.template-test[0]
Node ID            = 6a744838
Job ID             = template-test
Client Status      = pending
Client Description = <none>
Created At         = 01/17/17 10:17:56 CET

Task "template-test" is "pending"
Task Resources
CPU      Memory  Disk  IOPS  Addresses
100 MHz  10 MiB  0 B   0     

Recent Events:
Time                   Type      Description
01/17/17 10:17:56 CET  Received  Task received by client

Job file

job "template-test" {
    datacenters = ["dc1"]

    type = "batch"

    group "template-test" {
        task "template-test" {
            driver = "exec"

            config {
                command = "sh"
                args = ["-c", "sleep 5; cat local/test.conf; exit 0"]
            }

            template {
                destination = "local/test.conf"
                data = "{{ key \"configtest\" }}"
            }
        }
    }
}

Nomad Server logs

    No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true
               Version: 0.5.2

==> Nomad agent started! Log data will stream in below:

    2017/01/17 10:17:44 [INFO] serf: EventMemberJoin: cyril.global 127.0.0.1
    2017/01/17 10:17:44.021771 [INFO] nomad: starting 4 scheduling worker(s) for [service batch system _core]
    2017/01/17 10:17:44.021974 [INFO] client: using state directory /tmp/nomad/client
    2017/01/17 10:17:44.021997 [INFO] client: using alloc directory /tmp/nomad/alloc
    2017/01/17 10:17:44.022055 [DEBUG] client: built-in fingerprints: [arch cgroup consul cpu host memory network nomad signal storage vault env_aws env_gce]
    2017/01/17 10:17:44.022090 [INFO] nomad: adding server cyril.global (Addr: 127.0.0.1:4647) (DC: dc1)
    2017/01/17 10:17:44 [INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state (Leader: "")
    2017/01/17 10:17:44.022229 [INFO] fingerprint.cgroups: cgroups are available
    2017/01/17 10:17:44.022425 [DEBUG] client: fingerprinting cgroup every 15s
    2017/01/17 10:17:44.023563 [INFO] fingerprint.consul: consul agent is available
    2017/01/17 10:17:44.023687 [DEBUG] client: fingerprinting consul every 15s
    2017/01/17 10:17:44.023913 [DEBUG] fingerprint.cpu: frequency: 2800 MHz
    2017/01/17 10:17:44.023917 [DEBUG] fingerprint.cpu: core count: 4
    2017/01/17 10:17:44.230647 [DEBUG] fingerprint.network: Detected interface lo with IP 127.0.0.1 during fingerprinting
    2017/01/17 10:17:44.230698 [DEBUG] fingerprint.network: Unable to read link speed from /sys/class/net/lo/speed
    2017/01/17 10:17:44.230701 [DEBUG] fingerprint.network: link speed could not be detected and no speed specified by user. Defaulting to 1000
    2017/01/17 10:17:44.232203 [DEBUG] client: fingerprinting vault every 15s
    2017/01/17 10:17:45 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/01/17 10:17:45 [INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state
    2017/01/17 10:17:45 [DEBUG] raft: Votes needed: 1
    2017/01/17 10:17:47.370578 [INFO] client: node registration complete    2017/01/17 10:17:45 [DEBUG] raft: Vote granted from 127.0.0.1:4647. Tally: 1
    2017/01/17 10:17:47.370742 [DEBUG] client: updated allocations at index 1 (pulled 0) (filtered 0)
    2017/01/17 10:17:47.370838 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2017/01/17 10:17:45 [INFO] raft: Election won. Tally: 1
    2017/01/17 10:17:45 [INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state    2017/01/17 10:17:47.370964 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["test"]
    2017/01/17 10:17:45 [INFO] raft: Disabling EnableSingleNode (bootstrap)
    2017/01/17 10:17:45.507179 [INFO] nomad: cluster leadership acquired
    2017/01/17 10:17:45 [DEBUG] raft: Node 127.0.0.1:4647 updated peer set (2): [127.0.0.1:4647]
    2017/01/17 10:17:45.507725 [DEBUG] leader: reconciling job summaries at index: 0
    2017/01/17 10:17:46.232749 [DEBUG] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
    2017/01/17 10:17:47.284615 [DEBUG] fingerprint.env_gce: Could not read value for attribute "machine-type"
    2017/01/17 10:17:47.284636 [DEBUG] fingerprint.env_gce: Error querying GCE Metadata URL, skipping
    2017/01/17 10:17:47.284683 [DEBUG] client: applied fingerprints [arch cgroup consul cpu host memory network nomad signal storage]
    2017/01/17 10:17:47.284904 [DEBUG] driver.docker: using client connection initialized from environment
    2017/01/17 10:17:47.285082 [DEBUG] client: fingerprinting rkt every 15s
    2017/01/17 10:17:47.286854 [DEBUG] driver.exec: exec driver is enabled
    2017/01/17 10:17:47.286987 [DEBUG] client: fingerprinting docker every 15s
    2017/01/17 10:17:47.287101 [DEBUG] client: fingerprinting exec every 15s
    2017/01/17 10:17:47.369927 [DEBUG] client: available drivers [docker exec raw_exec java]
    2017/01/17 10:17:47.370057 [INFO] client: Node ID "6a744838-96b4-8bd1-d7ba-c472ce26aa1a"
    2017/01/17 10:17:47.371539 [ERR] client.consul: error discovering nomad servers: no Nomad Servers advertising service "nomad" in Consul datacenters: ["test"]
    2017/01/17 10:17:47.371612 [DEBUG] client: periodically checking for node changes at duration 5s
    2017/01/17 10:17:47.371829 [DEBUG] client: state updated to ready
    2017/01/17 10:17:56.100297 [DEBUG] worker: dequeued evaluation efe9ec58-8b9f-6ea1-d761-0ee180a2bc06
    2017/01/17 10:17:56.100378 [DEBUG] sched: <Eval 'efe9ec58-8b9f-6ea1-d761-0ee180a2bc06' JobID: 'template-test'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
    2017/01/17 10:17:56.101104 [DEBUG] http: Request /v1/jobs?region=global (1.500512ms)
    2017/01/17 10:17:56.101780 [DEBUG] worker: submitted plan for evaluation efe9ec58-8b9f-6ea1-d761-0ee180a2bc06
    2017/01/17 10:17:56.101856 [DEBUG] sched: <Eval 'efe9ec58-8b9f-6ea1-d761-0ee180a2bc06' JobID: 'template-test'>: setting status to complete
    2017/01/17 10:17:56.101902 [DEBUG] client: updated allocations at index 8 (pulled 1) (filtered 0)
    2017/01/17 10:17:56.102229 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
    2017/01/17 10:17:56.102500 [DEBUG] worker: updated evaluation <Eval 'efe9ec58-8b9f-6ea1-d761-0ee180a2bc06' JobID: 'template-test'>
    2017/01/17 10:17:56.102554 [DEBUG] worker: ack for evaluation efe9ec58-8b9f-6ea1-d761-0ee180a2bc06
    2017/01/17 10:17:56.103181 [DEBUG] http: Request /v1/evaluation/efe9ec58-8b9f-6ea1-d761-0ee180a2bc06?region=global (555.412µs)
    2017/01/17 10:17:56.104512 [DEBUG] http: Request /v1/evaluation/efe9ec58-8b9f-6ea1-d761-0ee180a2bc06/allocations?region=global (214.695µs)
    2017/01/17 10:17:56.110735 [DEBUG] client: starting task runners for alloc '47847e8c-f805-ca4b-e981-3ae38777e036'
    2017/01/17 10:17:56.110867 [DEBUG] client: starting task context for 'template-test' (alloc '47847e8c-f805-ca4b-e981-3ae38777e036')
    2017/01/17 10:17:56 [INFO] (runner) creating new runner (dry: false, once: false)
    2017/01/17 10:17:56 [DEBUG] (runner) final config (tokens suppressed):

{
  "Path": "",
  "Consul": "",
  "Token": "",
  "ReloadSignal": 1,
  "DumpSignal": 3,
  "KillSignal": 2,
  "Auth": {
    "Enabled": false,
    "Username": "",
    "Password": ""
  },
  "Vault": {
    "Address": "",
    "UnwrapToken": false,
    "RenewToken": false,
    "SSL": {
      "Enabled": true,
      "Verify": true,
      "Cert": "",
      "Key": "",
      "CaCert": "",
      "CaPath": "",
      "ServerName": ""
    }
  },
  "SSL": {
    "Enabled": false,
    "Verify": true,
    "Cert": "",
    "Key": "",
    "CaCert": "",
    "CaPath": "",
    "ServerName": ""
  },
  "Syslog": {
    "Enabled": false,
    "Facility": "LOCAL0"
  },
  "Exec": {
    "Command": "",
    "Splay": 0,
    "ReloadSignal": null,
    "KillSignal": 15,
    "KillTimeout": 30000000000
  },
  "MaxStale": 1000000000,
  "ConfigTemplates": [
    {
      "Source": "",
      "Destination": "/tmp/nomad/alloc/47847e8c-f805-ca4b-e981-3ae38777e036/template-test/local/test.conf",
      "EmbeddedTemplate": "{{ key \"configtest\" }}",
      "Command": "",
      "CommandTimeout": 0,
      "Perms": 420,
      "Backup": false,
      "LeftDelim": "",
      "RightDelim": "",
      "Wait": {
        "min": 0,
        "max": 0
      }
    }
  ],
  "Retry": 5000000000,
  "Wait": {
    "min": 0,
    "max": 0
  },
  "PidFile": "",
  "LogLevel": "WARN",
  "Deduplicate": {
    "Enabled": false,
    "Prefix": "consul-template/dedup/",
    "TTL": 15000000000
  }
}

    2017/01/17 10:17:56 [INFO] (clients) creating consul/api client
    2017/01/17 10:17:56 [INFO] (clients) creating vault/api client
    2017/01/17 10:17:56 [DEBUG] (clients) enabling vault SSL
    2017/01/17 10:17:56 [INFO] (runner) creating Watcher
    2017/01/17 10:17:56 [INFO] (runner) starting
    2017/01/17 10:17:56 [DEBUG] (runner) running initial templates
    2017/01/17 10:17:56 [INFO] (runner) running
    2017/01/17 10:17:56 [DEBUG] (runner) checking template a7c72e8af6bd3ec74e72f63455032e62
    2017/01/17 10:17:56 [INFO] (runner) was not watching 2 dependencies
    2017/01/17 10:17:56 [INFO] (watcher) adding "key(configtest)"
    2017/01/17 10:17:56 [DEBUG] (watcher) "key(configtest)" starting
    2017/01/17 10:17:56 [INFO] (watcher) adding "key(configtest)"
    2017/01/17 10:17:56 [DEBUG] (watcher) "key(configtest)" already exists, skipping
    2017/01/17 10:17:56 [INFO] (runner) diffing and updating dependencies
    2017/01/17 10:17:56 [INFO] (runner) watching 1 dependencies
    2017/01/17 10:17:56 [DEBUG] (view) "key(configtest)" starting fetch
    2017/01/17 10:17:56 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:0 WaitTime:1m0s Token: Near:}
    2017/01/17 10:17:56 [DEBUG](view) "key(configtest)" data was not present
    2017/01/17 10:17:56 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:17:56.220845 [DEBUG] client: updated allocations at index 10 (pulled 0) (filtered 1)
    2017/01/17 10:17:56.220943 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
    2017/01/17 10:18:57 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:18:57 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:19:44.923652 [DEBUG] http: Request /v1/jobs?prefix=template-test (1.383435ms)
    2017/01/17 10:19:44.928406 [DEBUG] http: Request /v1/job/template-test (3.432318ms)
    2017/01/17 10:19:44.932277 [DEBUG] http: Request /v1/job/template-test
    2017/01/17 10:20:01 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:20:01 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:21:03 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:21:03 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:22:03 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:22:03 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:22:45.508362 [DEBUG] worker: dequeued evaluation 9c0b83e9-e876-a3af-b95f-f553c405680b
    2017/01/17 10:22:45.508526 [DEBUG] sched.core: eval GC: scanning before index 0 (1h0m0s)
    2017/01/17 10:22:45.508693 [DEBUG] worker: ack for evaluation 9c0b83e9-e876-a3af-b95f-f553c405680b
    2017/01/17 10:22:45.508801 [DEBUG] worker: dequeued evaluation 8c1260fb-3736-73b1-9bd4-94571760db4d
    2017/01/17 10:22:45.508903 [DEBUG] sched.core: job GC: scanning before index 0 (4h0m0s)
    2017/01/17 10:22:45.508997 [DEBUG] worker: ack for evaluation 8c1260fb-3736-73b1-9bd4-94571760db4d
    2017/01/17 10:22:45.509104 [DEBUG] worker: dequeued evaluation 510e8440-ac18-c2a1-dfd5-cc25ede176a6
    2017/01/17 10:22:45.509189 [DEBUG] sched.core: node GC: scanning before index 0 (24h0m0s)
    2017/01/17 10:22:45.509273 [DEBUG] worker: ack for evaluation 510e8440-ac18-c2a1-dfd5-cc25ede176a6
    2017/01/17 10:23:04 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:23:04 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:23:12.283734 [DEBUG] client: updated allocations at index 10 (pulled 0) (filtered 1)
    2017/01/17 10:23:12.284105 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)
    2017/01/17 10:24:08 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:24:08 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:25:11 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:25:11 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:26:00.511325 [DEBUG] http: Request /v1/jobs?prefix=template-test (1.322226ms)
    2017/01/17 10:26:00.514341 [DEBUG] http: Request /v1/job/template-test (773.681µs)
    2017/01/17 10:26:00.523499 [DEBUG] http: Request /v1/job/template-test/allocations (1.560619ms)
    2017/01/17 10:26:00.531481 [DEBUG] http: Request /v1/job/template-test/evaluations (462.904µs)
    2017/01/17 10:26:00.534095 [DEBUG] http: Request /v1/job/template-test/summary (705.178µs)
    2017/01/17 10:26:14 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:26:14 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
    2017/01/17 10:27:15 [DEBUG] (view) "key(configtest)" no new data (index was the same)
    2017/01/17 10:27:15 [DEBUG] ("key(configtest)") querying consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:9679 WaitTime:1m0s Token: Near:}
  [...]
jippi commented 7 years ago

Thats documented behavior in consul-template - please use key_or_default (renamed to keyOrDefault in recent version, which is not bundled with Nomad yet) if you want to avoid this blocking behavior - or use keyExists around your usage

dadgar commented 7 years ago

@cyrilgdn As @jippi explained this is expected behavior. Nomad 0.5.3 will bring key_or_default which you can use to have non-blocking behavior. Further it will hopefully bring task events that will make understanding the behavior more straight forward! Thanks!

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.