hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.84k stars 1.95k forks source link

Occasionally seeing 'Cannot promote terminal deployment' for blue/green approach and deployment being automatically promoted #5305

Open sloan-dog opened 5 years ago

sloan-dog commented 5 years ago

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Output from nomad version 0.8.4

Operating system and Environment details

Ubuntu 16.04 Running a 5 node cluster on Ec2 t2.medium which each runs vault, nomad, and consul.

Issue

We use canary deployment strategy with matching canaries count to achieve blue green deployment. Our strategy is to poll for the deployment status until all canary allocs are healthy and then promote (we fail if any one becomes unhealthy by timeout). We use the docker driver.

However, we occasinally encounter an error when calling promote Cannot promote terminal deployment

Which is clear that the deployment is terminal, but I neither know what that means or how a canary deployment can become terminal without my doing. The odd thing is the container is the correct build version. (Build version is passed via env args to container and served in API)

Reproduction steps

TLDR; Interpolate docker image name and build id into job spec convert job ACL -> json Write to jobs endpoint Read deployment id Poll deployment for health Promote deployment

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

job "core-api-unstable" {
  datacenters = ["dc1"]

  type = "service"

  group "core-api-unstable" {

    count = 2

    update {
      # https://www.nomadproject.io/docs/job-specification/update.html
      max_parallel     = 1
      canary           = 2
      min_healthy_time = "10s"
      healthy_deadline = "1m"
      auto_revert      = true
      health_check = "checks"
      progress_deadline = "2m"
    }

    task "core-api-unstable" {

      template {
        data = <<EOH

...REDACTED

EOH
        destination = "secrets/file.env"
        env = true
      }

      logs {
        max_files = 10
        max_file_size = 10
      }
      driver = "docker"
      config {
        image = "$[[NOMAD_META_DOCKER_IMAGE_URI]]"
        network_mode = "host"
      }
      resources {
        cpu    = 500 # MHz
        memory = 500 # Megabytes
        network {
          mbits = 1
          port "http" {}
        }
      }
      # tells nomad how to register service with consul
      service {
        name = "core-api-unstable"

        # this tag is used by fabio to know that any requests with header Host: 'unstable-core-api.joe.coffee'
        # should target this job group
        tags = ["REDACTED/"]

        canary_tags = ["REDACTED"]
        port = "http"
        check {
          path = "/api/v1"
          name     = "unstable-core-api-alive"
          type     = "http"
          interval = "5s"
          timeout  = "2s"
        }
      }

      # TODO - extract these from vault
      env {
        "PG_DB_HOST" = "REDACTED"
        "PG_DB_PORT" = "5432"
        # dynamic ports automatically exposed via this env
        # format NOMAD_PORT_${portname}
        "PORT" = "${NOMAD_PORT_http}"
        "NODE_ENV" = "unstable"
        "BUILD_ID" = "$[[NOMAD_META_BUILD_ID]]"
      }
    }
  }
}
endocrimes commented 5 years ago

@sloan-dog That error is returned when the deployment status is not Running or Paused.

Could you try logging the data around the status of the deployment so we can get a clearer understanding of what is happening? - Thanks!

sloan-dog commented 5 years ago

I'll try and capture that, thanks!

sloan-dog commented 5 years ago

@endocrimes Back from the great beyond...

For context of this output...

Now using nomad 0.9.3ish I use a bash script to update jobs. Essentially, we take the updated job hcl (changes range from a new container image, to an increase or decrease in desired allocations), convert it to json, and submit it via HTTP API.

We extract the eval id from the update job response, and poll the eval status endpoint until the delployment id is available. Then, we poll the deployment status endpoint until all allocs are healthy or any alloc is unhealthy as determined by health checks. Upon exiting that poll with healthy allocs we call the promote endpoint with deployment id.

Based on the output below, (the changes tested here were simply alloc count), it appears either a new deployment is not actually being created or the existing deployment is being mutated. I do not know

Apologies for not pretty printing this output

Job spec json: 
{ "Job": { "Stop": null, "Region": null, "Namespace": null, "ID": "unstable-migration-runner", "ParentID": null, "Name": "unstable-migration-runner", "Type": "service", "Priority": null, "AllAtOnce": null, "Datacenters": [ "dc1" ], "Constraints": null, "Affinities": null, "TaskGroups": [ { "Name": "unstable-migration-runner", "Count": 2, "Constraints": null, "Affinities": null, "Tasks": [ { "Name": "unstable-migration-runner", "Driver": "docker", "User": "", "Config": { "args": [ "***REDACTED***" ], "command": "***REDACTED***", "image": "***REDACTED***" }, "Constraints": null, "Affinities": null, "Env": { ***REDACTED*** }, "Services": [ { "Id": "", "Name": "unstable-migration-runner", "Tags": [ "***REDACTED***" ], "CanaryTags": null, "PortLabel": "http", "AddressMode": "", "Checks": [ { "Id": "", "Name": "prod-migration-runner-alive", "Type": "script", "Command": "**REDACTED***", "Args": null, "Path": "", "Protocol": "", "PortLabel": "", "AddressMode": "", "Interval": 1000000000, "Timeout": 10000000000, "InitialStatus": "", "TLSSkipVerify": false, "Header": null, "Method": "", "CheckRestart": null, "GRPCService": "", "GRPCUseTLS": false } ], "CheckRestart": null } ], "Resources": { "CPU": 256, "MemoryMB": 252, "DiskMB": null, "Networks": [ { "Device": "", "CIDR": "", "IP": "", "MBits": null, "ReservedPorts": null, "DynamicPorts": [ { "Label": "http", "Value": 0 } ] } ], "Devices": null, "IOPS": null }, "Meta": null, "KillTimeout": null, "LogConfig": null, "Artifacts": null, "Vault": null, "Templates": [ { "SourcePath": null, "DestPath": "***REDACTED***", "EmbeddedTmpl": "***REDACTED***", "ChangeMode": "restart", "ChangeSignal": null, "Splay": 5000000000, "Perms": "0644", "LeftDelim": null, "RightDelim": null, "Envvars": true, "VaultGrace": null } ], "DispatchPayload": null, "Leader": false, "ShutdownDelay": 0, "KillSignal": "" } ], "Spreads": null, "RestartPolicy": { "Interval": null, "Attempts": null, "Delay": null, "Mode": "delay" }, "ReschedulePolicy": null, "EphemeralDisk": null, "Update": { "Stagger": null, "MaxParallel": 1, "HealthCheck": "checks", "MinHealthyTime": 5000000000, "HealthyDeadline": 60000000000, "ProgressDeadline": 120000000000, "Canary": 1, "AutoRevert": true, "AutoPromote": null }, "Migrate": null, "Meta": null } ], "Update": null, "Spreads": null, "Periodic": null, "ParameterizedJob": null, "Dispatched": false, "Payload": null, "Reschedule": null, "Migrate": null, "Meta": null, "VaultToken": null, "Status": null, "StatusDescription": null, "Stable": null, "Version": null, "SubmitTime": null, "CreateIndex": null, "ModifyIndex": null, "JobModifyIndex": null } }

Job update output: 
{"EvalID":"3662930c-5231-32a2-6ad5-43553f36049a","EvalCreateIndex":386446,"JobModifyIndex":386445,"Warnings":"","Index":386446,"LastContact":0,"KnownLeader":false}

Eval:
{"ID":"3662930c-5231-32a2-6ad5-43553f36049a","Namespace":"default","Priority":50,"Type":"service","TriggeredBy":"job-register","JobID":"unstable-migration-runner","JobModifyIndex":386445,"DeploymentID":"fcc90ed5-9e95-9d47-cbf8-cca40ba81137","Status":"complete","WaitUntil":"0001-01-01T00:00:00Z","QueuedAllocations":{"unstable-migration-runner":0},"SnapshotIndex":386446,"CreateIndex":386446,"ModifyIndex":386448}

deployment state: 
{ "ID": "fcc90ed5-9e95-9d47-cbf8-cca40ba81137", "Namespace": "default", "JobID": "unstable-migration-runner", "JobVersion": 283, "JobModifyIndex": 386445, "JobSpecModifyIndex": 386445, "JobCreateIndex": 111, "TaskGroups": { "unstable-migration-runner": { "AutoRevert": true, "AutoPromote": false, "ProgressDeadline": 120000000000, "RequireProgressBy": "2019-07-30T20:33:48.477419326Z", "Promoted": false, "PlacedCanaries": null, "DesiredCanaries": 0, "DesiredTotal": 2, "PlacedAllocs": 2, "HealthyAllocs": 0, "UnhealthyAllocs": 0 } }, "Status": "running", "StatusDescription": "Deployment is running", "CreateIndex": 386447, "ModifyIndex": 386447 }

...5 seconds later
deployment state: 
{ "ID": "fcc90ed5-9e95-9d47-cbf8-cca40ba81137", "Namespace": "default", "JobID": "unstable-migration-runner", "JobVersion": 283, "JobModifyIndex": 386445, "JobSpecModifyIndex": 386445, "JobCreateIndex": 111, "TaskGroups": { "unstable-migration-runner": { "AutoRevert": true, "AutoPromote": false, "ProgressDeadline": 120000000000, "RequireProgressBy": "2019-07-30T20:33:53.993524754Z", "Promoted": false, "PlacedCanaries": null, "DesiredCanaries": 0, "DesiredTotal": 2, "PlacedAllocs": 2, "HealthyAllocs": 1, "UnhealthyAllocs": 0 } }, "Status": "running", "StatusDescription": "Deployment is running", "CreateIndex": 386447, "ModifyIndex": 386451 }

...5 seconds later
deployment state: 
{ "ID": "fcc90ed5-9e95-9d47-cbf8-cca40ba81137", "Namespace": "default", "JobID": "unstable-migration-runner", "JobVersion": 283, "JobModifyIndex": 386445, "JobSpecModifyIndex": 386445, "JobCreateIndex": 111, "TaskGroups": { "unstable-migration-runner": { "AutoRevert": true, "AutoPromote": false, "ProgressDeadline": 120000000000, "RequireProgressBy": "2019-07-30T20:33:53.993524754Z", "Promoted": false, "PlacedCanaries": null, "DesiredCanaries": 0, "DesiredTotal": 2, "PlacedAllocs": 2, "HealthyAllocs": 1, "UnhealthyAllocs": 0 } }, "Status": "running", "StatusDescription": "Deployment is running", "CreateIndex": 386447, "ModifyIndex": 386451 }

...5 seconds later
deployment state: 
{ "ID": "fcc90ed5-9e95-9d47-cbf8-cca40ba81137", "Namespace": "default", "JobID": "unstable-migration-runner", "JobVersion": 283, "JobModifyIndex": 386445, "JobSpecModifyIndex": 386445, "JobCreateIndex": 111, "TaskGroups": { "unstable-migration-runner": { "AutoRevert": true, "AutoPromote": false, "ProgressDeadline": 120000000000, "RequireProgressBy": "2019-07-30T20:33:53.993524754Z", "Promoted": false, "PlacedCanaries": null, "DesiredCanaries": 0, "DesiredTotal": 2, "PlacedAllocs": 2, "HealthyAllocs": 1, "UnhealthyAllocs": 0 } }, "Status": "running", "StatusDescription": "Deployment is running", "CreateIndex": 386447, "ModifyIndex": 386451 }

...5 seconds later
deployment state:
{ "ID": "fcc90ed5-9e95-9d47-cbf8-cca40ba81137", "Namespace": "default", "JobID": "unstable-migration-runner", "JobVersion": 283, "JobModifyIndex": 386445, "JobSpecModifyIndex": 386445, "JobCreateIndex": 111, "TaskGroups": { "unstable-migration-runner": { "AutoRevert": true, "AutoPromote": false, "ProgressDeadline": 120000000000, "RequireProgressBy": "2019-07-30T20:33:53.993524754Z", "Promoted": false, "PlacedCanaries": null, "DesiredCanaries": 0, "DesiredTotal": 2, "PlacedAllocs": 2, "HealthyAllocs": 1, "UnhealthyAllocs": 0 } }, "Status": "running", "StatusDescription": "Deployment is running", "CreateIndex": 386447, "ModifyIndex": 386451 }

...5 seconds later :)
deployment state:
{ "ID": "fcc90ed5-9e95-9d47-cbf8-cca40ba81137", "Namespace": "default", "JobID": "unstable-migration-runner", "JobVersion": 283, "JobModifyIndex": 386445, "JobSpecModifyIndex": 386445, "JobCreateIndex": 111, "TaskGroups": { "unstable-migration-runner": { "AutoRevert": true, "AutoPromote": false, "ProgressDeadline": 120000000000, "RequireProgressBy": "2019-07-30T20:34:14.493525392Z", "Promoted": false, "PlacedCanaries": null, "DesiredCanaries": 0, "DesiredTotal": 2, "PlacedAllocs": 2, "HealthyAllocs": 2, "UnhealthyAllocs": 0 } }, "Status": "successful", "StatusDescription": "Deployment completed successfully", "CreateIndex": 386447, "ModifyIndex": 386457 }

All allocs healthy. Promoting deployment: fcc90ed5-9e95-9d47-cbf8-cca40ba81137

Promote payload
{ "DeploymentID": "fcc90ed5-9e95-9d47-cbf8-cca40ba81137", "All": true }

Promote result:
rpc error: can't promote terminal deployment

PS - Thanks for letting this be open so long. Always appreciate the hashicorp community