Drain allocation status is ignored when draining at capacity.

stswidwinski commented 1 year ago

Nomad version

Nomad v1.4.2 (039d70eeef5888164cad05e7ddfd8b6f8220923b)

However, this repros on v1.4.3 as well.

Operating system and Environment details

These do not matter. Unix/Linux.

Issue

When running a cluster which is running at capacity, a drain of a node which has service allocations running on it will create an evaluation which is Pending. This Pending evaluation will immediately be solved for if more capacity is added resulting in multiple allocations running for a single job, especially with large kill timeouts.

Under normal circumstances we expect that the allocation which has been drained blocks the creation of any new allocation.

Reproduction steps

Let us begin with the local setup. We will want two clients and one server. The first server and client are created using the usual, boring setup. Please note however that we set the max kill timeout to something considerable, such as an hour:

# Install nomad locally...
$ yumdownloader nomad-1.4.2-1
$ rpm2cpio  nomad-1.4.2-1.x86_64.rpm  | cpio -idmv --no-absolute-filenames

# Observe the configuration of the server-client combo
$ cat client_server.conf
# Observe the configuration of the server-client combo
$ cat client_server.conf
{
  "datacenter": "dc1",
  "region": "global",
  "data_dir": "/tmp/nomad/usr/bin/server_client",
  "log_level": "DEBUG",
  "client": {
    "enabled": true,
    "max_kill_timeout": "1h",
  }
}

# Run the server-client
$ ./usr/bin/nomad agent -dev -config ./client_server.conf

The second client is set up analogously, but we cannot use nomad agent -dev as easily. To avoid port conflicts we do:

# Look at the config of the second client
$ cat client_2.conf
{
  "datacenter": "dc1",
  "region": "global",
  "ports": {
    "http": "9876",
    "rpc": "9875",
    "serf": "9874"
  },
  "data_dir": "/tmp/nomad/usr/bin/client_2",
  "log_level": "DEBUG",
  "log_file": "/tmp/nomad_logs",
  "client": {
    "enabled": true,
    "servers": ["localhost"],
    "max_kill_timeout": "1h",
    "options": {
      "driver.allowlist" : "raw_exec"
    }
  },
  "plugin": {
    "raw_exec": {
      "args": [],
      "config": {
        "enabled": true,
        "no_cgroups": true 
      }
    }
  }
}

 # Start the second client
./usr/bin/nomad agent -client -config ./client_2.conf

After this setup we have two nodes with raw-exec enabled. Just as a sanity check:

$ ./usr/bin/nomad node status
ID        DC   Name             Class   Drain  Eligibility  Status
23a04c0f  dc1  <redacted>  <none>  false  eligible     ready
bad543f9  dc1  <redacted> <none>  false  eligible     ready

Then, start a job:

# I use pure Bash here, but the intent is the following:
# 
# A single task group, a single task. The task takes a long time to shut down and we allow it
# this time as legitimate. In this example we ignore SIGINT and SIGTERM so we will simply
# ignore Nomad attempting to stop us.
$ cat job.hcl
job "test_job" {
  region = "global"
  datacenters = ["dc1"]
  type = "service"

  group "one_and_only" {
    # Specify the number of these tasks we want.
    count = 1

    task "test" {
      driver = "raw_exec"

      kill_timeout = "1h"

      config {
        command = "bash"
        args = [ "-c", "trap \"\" INT TERM && /usr/bin/sleep inf"]
      }
    }
  }
}

# Run the job and observe that everything is OK
$ ./nomad job run  ./job.hcl 
==> 2023-02-09T14:20:25-05:00: Monitoring evaluation "b2f734d9"
    2023-02-09T14:20:25-05:00: Evaluation triggered by job "test_job"
    2023-02-09T14:20:25-05:00: Allocation "a6885833" created: node "23a04c0f", group "one_and_only"
    2023-02-09T14:20:26-05:00: Evaluation within deployment: "f4784227"
    2023-02-09T14:20:26-05:00: Allocation "a6885833" status changed: "pending" -> "running" (Tasks are running)
    2023-02-09T14:20:26-05:00: Evaluation status changed: "pending" -> "complete"
==> 2023-02-09T14:20:26-05:00: Evaluation "b2f734d9" finished with status "complete"
==> 2023-02-09T14:20:26-05:00: Monitoring deployment "f4784227"
  ✓ Deployment "f4784227" successful

    2023-02-09T14:20:36-05:00
    ID          = f4784227
    Job ID      = test_job
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
    one_and_only  1        1       1        0          2023-02-09T14:30:35-05:00

# Check the status of the job
$ ./nomad job status test_job
ID            = test_job
Name          = test_job
Submit Date   = 2023-02-09T14:20:25-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost  Unknown
one_and_only  0       0         1        0       0         0     0

Latest Deployment
ID          = f4784227
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
one_and_only  1        1       1        0          2023-02-09T14:30:35-05:00

Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created  Modified
a6885833  23a04c0f  one_and_only  0        run      running  35s ago  25s ago

Now, let us flip the node with no allocations to be unavailable. We want to simulate the situation in which we are running at full capacity:

$ ./nomad node status
ID        DC   Name             Class   Drain  Eligibility  Status
23a04c0f  dc1  <redacted>  <none>  false  eligible     ready
bad543f9  dc1  <redacted>  <none>  false  eligible     ready

$ ./nomad node drain -force -enable bad543f9
Are you sure you want to enable drain mode for node "bad543f9-4440-5208-48d3-3da8f168c3d7"? [y/N] y
2023-02-09T14:21:55-05:00: Ctrl-C to stop monitoring: will not cancel the node drain
2023-02-09T14:21:55-05:00: Node "bad543f9-4440-5208-48d3-3da8f168c3d7" drain strategy set
2023-02-09T14:21:55-05:00: All allocations on node "bad543f9-4440-5208-48d3-3da8f168c3d7" have stopped
2023-02-09T14:21:55-05:00: Drain complete for node bad543f9-4440-5208-48d3-3da8f168c3d7

The job continued to run just fine. Now, let us drain the node on which the job is currently running and inspect the state of allocations:

# Drain the node on which allocations are running.
# 
# NOTE: This command will hang. That is fine as we expect it to hang until everything shut down, which
# won't happen in this case. CTRL+c simply gets us the terminal back, but if you insist we can use another
# terminal for the purpose
$ ./nomad node drain -force -enable 23a04c0f
Are you sure you want to enable drain mode for node "23a04c0f-0603-4c3c-fe52-804f1f19e73e"? [y/N] y
2023-02-09T14:22:50-05:00: Ctrl-C to stop monitoring: will not cancel the node drain
2023-02-09T14:22:50-05:00: Node "23a04c0f-0603-4c3c-fe52-804f1f19e73e" drain strategy set
2023-02-09T14:22:51-05:00: Alloc "a6885833-97f0-411c-7235-d9f432a5ff42" marked for migration
2023-02-09T14:22:51-05:00: Drain complete for node 23a04c0f-0603-4c3c-fe52-804f1f19e73e
2023-02-09T14:22:51-05:00: Alloc "a6885833-97f0-411c-7235-d9f432a5ff42" draining

# The job continues to run
$ ./nomad job status test_job
ID            = test_job
Name          = test_job
Submit Date   = 2023-02-09T14:20:25-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost  Unknown
one_and_only  1       0         1        0       0         0     0

Placement Failure
Task Group "one_and_only":
  * No nodes were eligible for evaluation
  * No nodes are available in datacenter "dc1"

Latest Deployment
ID          = f4784227
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
one_and_only  1        1       1        0          2023-02-09T14:30:35-05:00

Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created    Modified
a6885833  23a04c0f  one_and_only  0        stop     running  4m12s ago  1m45s ago

# There exists an evaluation which is pending.
$ ./nomad job status -evals
ID        Type     Priority  Status   Submit Date
test_job  service  50        pending  2023-02-09T14:20:25-05:00

Now, let us make the node that had nothing running on it eligible again.

$ ./nomad node eligibility -enable bad543f9
Node "bad543f9-4440-5208-48d3-3da8f168c3d7" scheduling eligibility set: eligible for scheduling

And to our surprise the job which should have just 1 allocation has... Two! Both running.

$ ./nomad job status -evals test_job
ID            = test_job
Name          = test_job
Submit Date   = 2023-02-09T14:20:25-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost  Unknown
one_and_only  0       0         2        0       0         0     0

Evaluations
ID        Priority  Triggered By        Status    Placement Failures
96fd9ec0  50        queued-allocs       complete  false
9205b53b  50        node-drain          complete  true
4505bdc9  50        queued-allocs       canceled  true
b2a1698b  50        node-drain          complete  true
01f179eb  50        deployment-watcher  complete  false
b2f734d9  50        job-register        complete  false

Latest Deployment
ID          = f4784227
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group    Desired  Placed  Healthy  Unhealthy  Progress Deadline
one_and_only  1        1       1        0          2023-02-09T14:30:35-05:00

Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created    Modified
fa866f32  bad543f9  one_and_only  0        run      running  33s ago    23s ago
a6885833  23a04c0f  one_and_only  0        stop     running  7m20s ago  4m53s ago

Expected Result

The behavior should be consistent with regular drain behavior in which we do not schedule additional allocations until the last allocation is in a terminal state.

Actual Result

We schedule extra allocations and ignore the state of the old ones.

The logs don't contain much insight into what happened.

tgross commented 1 year ago

Hi @stswidwinski!

First some background: note that there's a difference between the DesiredStatus and the ClientStatus (in the API, these are Desired and Status in the CLI headers). When the leader accepts a plan from a scheduler worker, it sets the DesiredStatus. The client then pulls allocation updates from the servers and will set ClientStatus once it has completed setting that status on the allocation.

For the case of shutdown_delay, that should be after the shutdown delay occurs. And this does seem to match up with what you're seeing:

Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created    Modified
fa866f32  bad543f9  one_and_only  0        run      running  33s ago    23s ago
a6885833  23a04c0f  one_and_only  0        stop     running  7m20s ago  4m53s ago

But you'd expect the drainer to respect the ClientStatus. Unfortunately, it does not! This is reflected in known bugs #14293 and #12915 and #9902. I have a partial PR already up to fix this in https://github.com/hashicorp/nomad/pull/14348 but never got around to finishing it. Fortunately I've got some time carved out in the next few weeks (probably March) to focus on documenting, dianosing, and fixing some drain behaviors. I'll pick up this issue as part of that work and #14348 should do the job for you here.

stswidwinski commented 1 year ago

This makes sense. Thank you! Looking forward to the patch :)

tgross commented 1 year ago

This issue is fixed by https://github.com/hashicorp/nomad/pull/14348, which will ship in the next regular patch release of Nomad.

stswidwinski commented 1 year ago

I have just tested this against 1.5.5 and the bug as described still occurs in the same way. The repro remains the same, except now it's against 1.5.5 and not 1.4.3 which makes the logging output a little bit different.

@tgross, I think that your patch changes the handling of stopping allocations correctly in the case of non-blocked evaluations, but leave the blocked evaluation case in the old state. Do you mind taking another look?

tgross commented 1 year ago

Re-opening

hashicorp / nomad