hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
15k stars 1.96k forks source link

job stopped and not restarted when vault timeout #8556

Open chris93111 opened 4 years ago

chris93111 commented 4 years ago

Nomad version

Nomad v0.11.3 (8918fc804a0c6758b6e3e9960e4eb2e605e38552)

Operating system and Environment details

redhat 7

Issue

when vault request timeout the job down and he not try to restart or recheck vault available , job need to be restarted manually to work again

Reproduction steps

start job and stop vault

Job file (if appropriate)

group "node" { restart { attempts = 3 delay = "120s" } task "web" { template { data = <<EOF SECRET_KEY = "{{with secret "config/production"}}{{.Data.data.secret_key}}{{end}}" DATABASE_NAME = "{{with secret "config/production"}}{{.Data.data.pg_db_name}}{{end}}" DATABASE_PASSWORD = "{{with secret "config/production"}}{{.Data.data.pg_db_pass}}{{end}}" DATABASE_USER = "{{with secret "config/production"}}{{.Data.data.pg_db_user}}{{end}}" DATABASE_PORT = "{{with secret "config/production"}}{{.Data.data.pg_db_port}}{{end}}" DATABASE_HOST = "{{with secret "config/production"}}{{.Data.data.pg_db_host}}{{end}}" ADMIN_PASSWORD = "{{with secret "config/production"}}{{.Data.data.admin_default_pass}}{{end}}" VERSION = "{{with secret "config/production"}}{{.Data.data.version}}{{end}}"

    EOF

    destination = "secrets/file.env"
    change_mode = "restart"
    env = true
  }

Nomad Server logs (if appropriate)

Jul 28 09:38:11 nomad01 nomad[14097]: 2020/07/28 09:38:11.533272 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 2 after "500ms") Jul 28 09:38:11 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 2 after "500ms") Jul 28 09:38:12 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 3 after "1s") Jul 28 09:38:12 nomad01 nomad[14097]: 2020/07/28 09:38:12.035696 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 3 after "1s") Jul 28 09:38:13 nomad01 nomad[14097]: 2020/07/28 09:38:13.037049 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 4 after "2s") Jul 28 09:38:13 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 4 after "2s") Jul 28 09:38:15 nomad01 nomad[14097]: 2020/07/28 09:38:15.038490 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 5 after "4s") Jul 28 09:38:15 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 5 after "4s") Jul 28 09:38:19 nomad01 nomad[14097]: 2020/07/28 09:38:19.039867 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 6 after "8s") Jul 28 09:38:19 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 6 after "8s") Jul 28 09:38:27 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 7 after "16s") Jul 28 09:38:27 nomad01 nomad[14097]: 2020/07/28 09:38:27.043118 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 7 after "16s") Jul 28 09:38:43 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 8 after "32s") Jul 28 09:38:43 nomad01 nomad[14097]: 2020/07/28 09:38:43.047283 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 8 after "32s") Jul 28 09:39:15 nomad01 nomad[14097]: 2020/07/28 09:39:15.048652 [WARN] (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 9 after "1m0s") Jul 28 09:39:15 nomad01 nomad[14097]: (view) catalog.nodes: Unexpected response code: 500 (No known Consul servers) (retry attempt 9 after "1m0s") Jul 28 09:39:37 nomad01 nomad[14097]: client.fingerprint_mgr.vault: Vault is unavailable Jul 28 09:39:37 nomad01 nomad[14097]: 2020-07-28T09:39:37.461+0200 [INFO] client.fingerprint_mgr.vault: Vault is unavailable Jul 28 09:39:43 nomad01 nomad[14097]: 2020-07-28T09:39:43.708+0200 [INFO] client: node registration complete Jul 28 09:43:15 nomad01 nomad[14097]: 2020/07/28 09:43:15.639889 [INFO] (runner) stopping Jul 28 09:43:15 nomad01 nomad[14097]: 2020-07-28T09:43:15.839+0200 [INFO] client.gc: marking allocation for GC: alloc_id=cdab424e-0d9c-ec2b-8a49-43f93ba26dae Jul 28 09:43:15 nomad01 nomad[14097]: client.gc: marking allocation for GC: alloc_id=cdab424e-0d9c-ec2b-8a49-43f93ba26dae Jul 28 09:44:21 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "c7fdac8a-694c-1cf5-1112-ff64db63a165", node: "783807b5- Jul 28 09:44:21 nomad01 nomad[14097]: 2020-07-28T09:44:21.005+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "c7fdac8a-694c-1cf5- Jul 28 09:44:21 nomad01 nomad[14097]: 2020-07-28T09:44:21.005+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "23fee0b1-0700-2e64- Jul 28 09:44:21 nomad01 nomad[14097]: 2020-07-28T09:44:21.005+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "cdab424e-0d9c-ec2b- Jul 28 09:44:21 nomad01 nomad[14097]: 2020-07-28T09:44:21.005+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "bcfe9a88-6ac3-f2d1- Jul 28 09:44:21 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "23fee0b1-0700-2e64-831b-f29a3fcf0387", node: "d003b8ed- Jul 28 09:44:21 nomad01 nomad[14097]: 2020-07-28T09:44:21.005+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "859ebf6b-7c18-8ff1- Jul 28 09:44:21 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "cdab424e-0d9c-ec2b-8a49-43f93ba26dae", node: "0d509c3e- Jul 28 09:44:21 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "bcfe9a88-6ac3-f2d1-ce7a-9a4ed47956e2", node: "d7051094- Jul 28 09:44:21 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "859ebf6b-7c18-8ff1-d624-ceadcc48b86d", node: "3a3d4f32- Jul 28 09:57:14 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "aae00bd5-34fa-3713-4b43-8436e0c4c6bf", node: "3a3d4f32- Jul 28 09:57:14 nomad01 nomad[14097]: 2020-07-28T09:57:14.729+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "aae00bd5-34fa-3713- Jul 28 09:57:14 nomad01 nomad[14097]: 2020-07-28T09:57:14.730+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "936703ca-7462-ea0d- Jul 28 09:57:14 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "936703ca-7462-ea0d-4580-29f595559c55", node: "d7051094- Jul 28 09:57:47 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "e4e74e06-560e-5b4f-3f41-6b213c0a8e1b", node: "d003b8ed- Jul 28 09:57:47 nomad01 nomad[14097]: 2020-07-28T09:57:47.516+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "e4e74e06-560e-5b4f- Jul 28 09:59:39 nomad01 nomad[14097]: 2020-07-28T09:59:39.270+0200 [WARN] nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "e4a07819-fcc6-9b7f- Jul 28 09:59:39 nomad01 nomad[14097]: nomad.vault: failed to revoke tokens. Will reattempt until TTL: error="failed to revoke token (alloc: "e4a07819-fcc6-9b7f-2a98-522f99758b3b", node: "783807b5- Jul 28 10:03:40 nomad01 nomad[14097]: 2020-07-28T10:03:40.706+0200 [INFO] client.fingerprint_mgr.vault: Vault is available Jul 28 10:03:40 nomad01 nomad[14097]: client.fingerprint_mgr.vault: Vault is available Jul 28 10:03:46 nomad01 nomad[14097]: client: node registration complete Jul 28 10:03:46 nomad01 nomad[14097]: 2020-07-28T10:03:46.727+0200 [INFO] client: node registration complete Jul 28 10:49:09 nomad01 nomad[14097]: client.gc: garbage collecting allocation: alloc_id=cdab424e-0d9c-ec2b-8a49-43f93ba26dae reason="forced collection"

Nomad Client logs (if appropriate)

l 28 09:55:55 node01 nomad[15942]: (view) vault.read(config/production): vault.read(config/production): Get "http:/vault:8200/v1/daXXXXXXXXXXXXXXXXXX i/o timeout Jul 28 09:55:55 node01 nomad[15942]: 2020/07/28 09:55:55.751116 [WARN] (view) vault.read(config/production): vault.read(config/production): Get "http://vault Jul 28 09:55:55 node01 nomad[15942]: 2020/07/28 09:55:55.751118 [WARN] (view) vault.read(config/production): vault.read(config/production): Get "http://vault Jul 28 09:55:55 node01 nomad[15942]: 2020/07/28 09:55:55.751210 [ERR] (runner) watcher reported error: vault.read(config/production): vault.read(config/produc XXXXXXXXXXX i/o timeout Jul 28 09:55:55 node01 nomad[15942]: (view) vault.read(config/production): vault.read(config/production): Get "http:/vault:8200/v1/da Jul 28 09:55:55 node01 nomad[15942]: (view) vault.read(config/production): vault.read(config/production): Get "http:/vault:8200/v1/da Jul 28 09:55:55 node01 nomad[15942]: (runner) watcher reported error: vault.read(config/production): vault.read(config/production): Get "http://vault Jul 28 09:56:00 node01 nomad[15942]: client.driver_mgr.docker: stopped container: container_id=d7005b117baa2520d93913e676adba5fb7d6336ee218af6fba228af90e2c8616 driver=docker Jul 28 09:56:00 node01 nomad[15942]: 2020-07-28T09:56:00.944+0200 [INFO] client.driver_mgr.docker: stopped container: container_id=d7005b117baa2520d93913e676adba5fb7d6336ee218af6fba228af90e2c8616 Jul 28 09:56:00 node01 nomad[15942]: (runner) stopping Jul 28 09:56:00 node01 nomad[15942]: 2020/07/28 09:56:00.990275 [INFO] (runner) stopping Jul 28 09:56:01 node01 nomad[15942]: (runner) stopping Jul 28 09:56:01 node01 nomad[15942]: 2020/07/28 09:56:01.296966 [INFO] (runner) stopping Jul 28 09:56:01 node01 nomad[15942]: 2020/07/28 09:56:01.297036 [INFO] (runner) received finish Jul 28 09:56:01 node01 nomad[15942]: (runner) received finish Jul 28 09:56:03 node01 nomad[15942]: 2020/07/28 09:56:03.506544 [INFO] (runner) stopping Jul 28 09:56:03 node01 nomad[15942]: (runner) stopping Jul 28 09:56:03 node01 nomad[15942]: 2020/07/28 09:56:03.506680 [INFO] (runner) received finish Jul 28 09:56:03 node01 nomad[15942]: 2020-07-28T09:56:03.506+0200 [INFO] client.gc: marking allocation for GC: alloc_id=aae00bd5-34fa-3713-4b43-8436e0c4c6bf Jul 28 09:56:03 node01 nomad[15942]: (runner) received finish

chris93111 commented 4 years ago

https://github.com/hashicorp/nomad/issues/2689