hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.95k stars 1.96k forks source link

Occasionally, when a template changes, allocations are not restarted. #4770

Closed dansteen closed 5 years ago

dansteen commented 6 years ago

Nomad version

Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)

Operating system and Environment details

debian 9.3

Issue

We have an issue where dynamic credentials generated by vault will get updated, and the file that the template will write to is updated, but the application is not restarted.

An example of this is the following allocation. Notice that the alloc modified time has not changed since it was created, and that no app restart has taken place since it was created on the 2018-10-03:

nomad alloc status b4ba0912                                                                                        
ID                  = b4ba0912
Eval ID             = 4d7f119e
Name                = traefik-prod.traefik[0]
Node ID             = eb919538
Job ID              = traefik-prod
Job Version         = 6
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created             = 7d28m ago
Modified            = 7d27m ago

Task "app" is "running"
Task Resources
CPU        Memory          Disk     IOPS  Addresses
5/200 MHz  25 MiB/200 MiB  300 MiB  0     http: 10.0.50.138:9080
                                          https: 10.0.50.138:9443
                                          admin: 10.0.50.138:9090

Task Events:
Started At     = 2018-10-03T20:10:17Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2018-10-03T16:10:17-04:00  Started                Task started by client
2018-10-03T16:10:16-04:00  Downloading Artifacts  Client is downloading artifacts
2018-10-03T16:10:14-04:00  Task Setup             Building Task Directory
2018-10-03T16:10:14-04:00  Received               Task received by client

But, if i dig into the allocation folder, I see that the file that the template writes to has been updated on 2018-10-10. Per the job file (below) if that file is updated, it should trigger a restart of the application:

/var/nomad/alloc/b4ba0912-1a96-43fa-662b-9f988760a3e5/app/secrets$ ls -ltr
total 8
-rwxr-xr-x 1 root root 36 Oct  3 20:10 vault_token
-rw-r--r-- 1 root root 85 Oct 10 19:33 file.env

Reproduction steps

Given the following terraform config:

# first create the vault role that will allow access to the consul ACL generator
data "template_file" "vault-policy-traefik" {
  template = "${file("${root.path}/vault-policy-traefik.tpl")}"

  vars {
    env  = "stag"
  }
}

resource "vault_policy" "traefik" {
  name   = "traefik-stag"
  policy = "${data.template_file.vault-policy-traefik.rendered}"
}

# this is the policy that allows traefik needs.
data "template_file" "consul-policy-traefik" {
  template = "${file("${root.path}/consul-policy-traefik.tpl")}"

  vars {
    env  = "${var.env}"
  }
}

# create the consul role for this app so we can generate consul acl tokens on-the-fly
resource "vault_generic_secret" "traefik-acl" {
  path = "consul/roles/traefik-stag"

  # this will cause the lease to be refreshed every hour.  It is still subject to the global max_ttl.
  data_json = <<EOT
  {
    "token_type":"client",
    "lease":"3600",
    "policy":"${base64encode(data.template_file.consul-policy-traefik.rendered)}"
  }
EOT
}

and the template file consul-policy-traefik.tpl referenced above:

key "traefik/${env}" {
  policy = "write"
}

session "" {
  policy = "write"
}

service "" {
  policy = "read"
}

node "" {
  policy = "read"
}

agent "" {
  policy = "read"
}

and the template file vault-policy-traefik.tpl referenced above:

# traefik needs to get consul creds
path "consul/creds/traefik-${env}" {
  capabilities = [ "read" ]
}

Job file

The following nomad job will not restart the command when the template data changes:

job "traefik-stag" {
  datacenters = ["awse"]
  type        = "system"

  constraint {
    attribute = "${meta.env}"
    value     = "stag"
  }

  # set our update policy
  update {
    max_parallel     = 1
    health_check     = "checks"
    min_healthy_time = "30s"
    healthy_deadline = "3m"

    auto_revert = true

    #canary           = 1
    #stagger          = "30s"
  }

  group "traefik" {
    restart {
      interval = "2m"
      attempts = "3"
      delay    = "30s"
      mode     = "delay"
    }

    task "app" {
      leader       = true
      kill_timeout = "90s"

      template {
        data = <<EOH
# set the CONSUL_HTTP_TOKEN
CONSUL_HTTP_TOKEN="<<with secret (printf "consul/creds/traefik-%s" (env "CHEF_ENV"))>><<.Data.token>><<end>>"
EOH

        destination     = "secrets/file.env"
        left_delimiter  = "<<"
        right_delimiter = ">>"
        env             = true
        change_mode     = "restart"
        vault_grace     = "50m"
        splay           = "1m"
      }

      # set our environment variables
      # grab the treafik binary
      artifact {
        source      = "s3://s3.amazonaws.com/install-files/traefik-1.7.2_linux_amd64.gz"
        destination = "local/traefik"
        mode        = "file"
      }

      env {
        CHEF_ENV   = "stag"
        NOMAD_ADDR = "https://localhost:4646"
        VAULT_ADDR = "https://active.vault.service.consul:8200"
      }

      # grant access to secrets
      vault {
        policies    = ["traefik-stag", "default"]
        change_mode = "restart"
      }

      # run the job
      driver = "exec"

      config {
        command = "local/traefik"

        args = [
          "--constraints=tag==${meta.env}",
          "--entrypoints=Name:http Address::9080 Redirect.Regex:^http://(.*)(:[0-9]*)?/(.*) Redirect.Replacement:https://$1/$3 Redirect.Permanent:true",
          "--entrypoints=Name:https Address::9443",
          "--entrypoints=Name:traefik Address::9090",
          "--defaultentrypoints=http,https",
          "--lifecycle.requestacceptgracetimeout=10",
          "--ping=true",
          "--api=true",
          "--metrics=true",
          "--metrics.datadog=true",
          "--consulcatalog.exposedbydefault=false",
          "--consul=true",
          "--consul.prefix=traefik/${meta.env}",
          "--consul.watch=true",
        ]
      }

      resources {
        cpu    = 200
        memory = 200

        network {
          port "http" {
            static = "9080"
          }

          port "https" {
            static = "9443"
          }

          port "admin" {
            static = "9090"
          }
        }
      }

      # add in service discovery
      service {
        name = "traefik"

        tags = ["${node.unique.name}", "host__${node.unique.name}", "version__1076149061919b209b3fa839c4f8d6ca1e263658", "${meta.env}", "env__${meta.env}"]
        port = "http"

        check {
          name           = "http"
          path           = "/ping"
          initial_status = "critical"
          type           = "http"
          protocol       = "http"
          port           = "admin"
          interval       = "10s"
          timeout        = "2s"
        }
      }
    }
  }
}
angrycub commented 6 years ago

Also could be related to #4226

endocrimes commented 6 years ago

It looks like this bug has been fixed as part of the client refactoring work in 0.9, given the following job file:

job "example" {
  datacenters = ["dc1"]

  type = "batch"

  group "cache" {
    count = 1

    restart {
      attempts = 2
      interval = "30m"

      delay = "15s"

      mode = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "redis" {
      driver = "raw_exec"

      config {
        command = "bash"
        args    = ["-c", "env; sleep 1000"]
      }

      resources {
        network {
          mbits = 10
          port  "db"  {}
        }
      }

      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      template {
        data        = "---\nkey: {{ key \"foo\" }}"
        destination = "local/file.yml"
        change_mode = "restart"
      }
    }
  }
}

I get the following after changing consul values:

[nomad(b-consul-template)] $ nomad status 76fc88df
ID                  = 76fc88df
Eval ID             = a8ed26c9
Name                = example.cache[0]
Node ID             = 22bca3ea
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 30s ago
Modified            = 2s ago

Task "redis" is "running"
Task Resources
CPU         Memory          Disk     IOPS  Addresses
26/100 MHz  22 MiB/300 MiB  300 MiB  0     db: 127.0.0.1:26380

Task Events:
Started At     = 2018-11-21T15:04:23Z
Finished At    = N/A
Total Restarts = 2
Last Restart   = 2018-11-21T16:04:23+01:00

Recent Events:
Time                       Type        Description
2018-11-21T16:04:23+01:00  Started     Task started by client
2018-11-21T16:04:23+01:00  Restarting  Task restarting in 0s
2018-11-21T16:04:18+01:00  Restarting  Template with change_mode restart re-rendered
2018-11-21T16:03:55+01:00  Started     Task started by client
2018-11-21T16:03:55+01:00  Task Setup  Building Task Directory
vkiranananda commented 5 years ago

change_mode = "restart" is not good idea ... If config have errors service fail.

My configs

 template {
        data        = "{{ range ls \"fedsp/apache2/sites-enabled\" }} {{ .Value }} \n\n {{ end }}"
        destination   = "sites.conf"
        change_mode   = "signal"
        change_signal = "SIGUSR1"
      }

file don't update :(

endocrimes commented 5 years ago

Closing this issue because the root issue is fixed in 0.9

frederikbosch commented 4 years ago

To what extend is it know if this is really fixed?

I have the following template:

      template {
        data = "{{with secret \"secret/kv/certificate/domain\"}}{{.Data.privkey}}{{end}}"
        destination = "secrets/cert.key"
        change_mode = "signal"
        change_signal = "SIGHUP"
      }

      template {
        data = "{{with secret \"secret/kv/certificate/domain\"}}{{.Data.fullchain}}{{end}}"
        destination = "secrets/cert.crt"
        change_mode = "signal"
        change_signal = "SIGHUP"
      }

The key is updated in Vault, but the template is not updated in job files and there is also no SIGHUP sent as indicated by the change_mode.

wlonkly commented 4 years ago

@frederikbosch Vault templates don't refresh when the Vault secret changes; they refresh based on the TTL of the secret. The rest of this issue was about Consul changes (which do change when the Consul data changes).

I can't directly link to the right section of consul-template's README so I'll paste here:

Please note that Vault does not support blocking queries. As a result, Consul Template will not immediately reload in the event a secret is changed as it does with Consul's key-value store. Consul Template will renew the secret with Vault's Renewer API. The Renew API tries to use most of the time the secret is good, renewing at around 90% of the lease time (as set by Vault).

(This bit me too in the past, so you're not alone!)

frederikbosch commented 4 years ago

@wlonkly I found that one out in the meanwhile. If you are using a V1 KV, you can create a secret called TTL with the value set to an int (seconds), and then that is used as TTL by Nomad / Consul Template.

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.