hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.8k stars 1.94k forks source link

Replacement allocations during canary deployments are placed in the wrong datacenter #17651

Closed lgfa29 closed 1 year ago

lgfa29 commented 1 year ago

Nomad version

Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

But it likely happens in previous versions as well.

Operating system and Environment details

N/A

Issue

When a job with canary deployments changes its datacenters and, during deployment, an allocation for the previous version fails, the replacement allocations are created in nodes for the new datacenters value instead of the original one.

Reproduction steps

  1. Start a Nomad cluster with clients in two datacenters. You start a nomad agent -dev and run the job below to create some extra clients.
    Nomad clients jobfile
locals {
  # Adjust to the appropriate path.
  nomad_path = "/opt/hashicorp/nomad/1.5.6/nomad"

  client_config = <<EOF
data_dir   = "{{env "NOMAD_TASK_DIR"}}/data"
name       = "%s"
datacenter = "%s"

client {
  enabled = true

  server_join {
    retry_join = ["127.0.0.1"]
  }
}

server {
  enabled = false
}

ports {
  http = "46%d6"
  rpc  = "46%[3]d7"
  serf = "46%[3]d8"
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}
EOF
}

job "nomad" {
  group "clients" {
    task "client-dc2" {
      driver = "raw_exec"

      config {
        command = local.nomad_path
        args    = ["agent", "-config", "local/config.hcl"]
      }

      template {
        data        = format(local.client_config, "client-dc2", "dc2", 5)
        destination = "local/config.hcl"
      }
    }

    task "client-dc3" {
      driver = "raw_exec"

      config {
        command = local.nomad_path
        args    = ["agent", "-config", "local/config.hcl"]
      }

      template {
        data        = format(local.client_config, "client-dc3", "dc3", 6)
        destination = "local/config.hcl"
      }
    }
  }
}

  1. Run sample job below. One of the allocations will keep failing.
  2. Update the job's datacenter.

    job "sleep" {
    - datacenters = ["dc2"]
    + datacenters = ["dc3"]
  3. Run job again and monitor its allocations
    $ watch -n1 nomad job allocs sleep
    ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
    5db9c92b  74384522  sleep       0        run      pending  11s ago    11s ago
    191fcde0  74384522  sleep       1        run      running  58s ago    47s ago
    605c2ac1  a8201b46  sleep       0        run      running  1m16s ago  1m5s ago
    beb6b693  a8201b46  sleep       0        stop     failed   1m16s ago  11s ago

Expected Result

The replacement allocation is created in the same node and in the same datacenter as it was originally.

Actual Result

The replacement allocation is created in the new datacenter.

Job file (if appropriate)

job "sleep" {
  datacenters = ["dc2"]

  update {
    max_parallel = 1
    canary       = 1
    auto_revert  = true
  }

  group "sleep" {
    count = 2
    task "sleep" {
      driver = "raw_exec"

      config {
        command = "/bin/bash"
        args    = ["${NOMAD_TASK_DIR}/script.sh"]
      }

      template {
        data        = <<EOF
while true; do
  if [ "$NOMAD_ALLOC_INDEX" -gt "0" ]; then
    echo "Boom"
    exit 1
  fi
  sleep 3
done
EOF
        destination = "${NOMAD_TASK_DIR}/script.sh"
      }
    }
  }
}
tgross commented 1 year ago

@lgfa29 should this be closed by #17598, #17654, #17653, and #17652?

lgfa29 commented 1 year ago

Yes, sorry. I had a typo in FIxes 😅