Forcing a placement with failed deployment

hynek commented 7 years ago

Nomad version

Nomad v0.6.0

Operating system and Environment details

Ubuntu Xenial running in LXD

Issue

So this is a bit more obscure than I initially thought.

We had a bit of a rough time (hilariously, consul running an LXD container hanging up a whole metal server) and I had to kill off a node in the middle of a deployment because it just hang while supposedly downloading a docker container.

At this point it had already placed an alloc on another node however I can't make it to re-try to place the second one again. Both nomad run and nomad plan just pretend like everything is fine.

I was able to fill up clients that we lost during an outage tonight so it seems to be specific to the failed deployment?

I could only fix it by forcing a change in the plan (I just rebuilt the container).

Nomad Status

$ nomad status enduser
ID            = enduser
Name          = enduser
Submit Date   = 08/23/17 21:28:51 CEST
Type          = service
Priority      = 50
Datacenters   = scaleup
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
enduser     0       0         1        0       2         1

Latest Deployment
ID          = 6f7b3785
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
enduser     2        2       1        1

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
6c01a752  b77a68ed  enduser     3        stop     lost      08/24/17 10:11:16 CEST
e2693daa  60101861  enduser     3        run      running   08/24/17 10:11:16 CEST
db58bf99  c7caeb6b  enduser     3        run      complete  08/24/17 05:14:15 CEST
2f2c1d45  c7caeb6b  enduser     3        run      complete  08/23/17 21:28:52 CEST

Nomad plan

Job: "enduser"
Task Group: "enduser" (1 ignore)
  Task: "enduser"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 858133
To submit the job with version verification run:

nomad run -check-index 858133 _enduser.hcl

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Nomad run

$ nomad run _enduser.hcl
==> Monitoring evaluation "788e8161"
    Evaluation triggered by job "enduser"
    Evaluation within deployment: "6f7b3785"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "788e8161" finished with status "complete"

Nomad alloc-status for the broken alloc

$  nomad alloc-status 6c01a752
ID                  = 6c01a752
Eval ID             = bf04b0f5
Name                = enduser.enduser[1]
Node ID             = b77a68ed
Job ID              = enduser
Job Version         = 3
Client Status       = failed
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created At          = 08/24/17 10:11:16 CEST
Deployment ID       = 6f7b3785
Deployment Health   = unhealthy

Task "enduser" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
500 MHz  128 MiB  300 MiB  0     http: 10.6.32.3:29213

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                    Type        Description
08/24/17 10:56:55 CEST  Killing     Killing task: vault: failed to derive token: Can't request Vault token for terminal allocation
08/24/17 10:16:17 CEST  Driver      Downloading image docker.XXX/enduser:1136
08/24/17 10:11:17 CEST  Task Setup  Building Task Directory
08/24/17 10:11:17 CEST  Received    Task received by client

Nomad eval-status

nomad eval-status 788e8161
ID                 = 788e8161
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = job-register
Job ID             = enduser
Priority           = 50
Placement Failures = false

Job file (if appropriate)

job "enduser" {
  datacenters = ["scaleup"]

  update {
    max_parallel = 1
  }

  group "enduser" {
    count = 2

    task "enduser" {
      driver = "docker"

      config {
        image = "https://docker.XXX/enduser:1136"
        port_map = {
          http = 8000
        }
      }

      env {
        APP_ENV = "prod"
      }

      service {
        name = "enduser"
        port = "http"
        tags = [
          "env-prod",
        ]

        check {
          type     = "http"
          protocol = "https"
          path     = "/mail/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      resources {
        cpu    = 500
        memory = 128

        network {
          mbits = 10

          port "http" {}
        }
      }

      vault {
        policies = ["enduser-prod"]
      }
    }
  }
}

dadgar commented 7 years ago

@hynek Yeah this is an interesting one. Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.

Potentially need a nomad run -force command to override.

urjitbhatia commented 6 years ago

@dadgar we need this too! This is what happened:

our task count is something like 80 and we run docker containers on a bunch of underlying EC2 servers.
a server was reaped and a new one was added. A bug in the init script prevented it from joining the nomad worker pool
Nomad then refused to deploy newer versions of our tasks (new docker images) because it had insufficient resources (CPU)
This caused our hybrid cluster (Nomad + non-nomad) to have different versions deployed to production

xeroc commented 2 years ago

Our task fails due to broken connection to underlying database and causes the allocation to be in failure state. A nomad job run wouldn't allow me to bring it back (after fixing the underlying database issue). I have to stop the task, wait for it to be stopped and rerun the job.

I'd like to be able to restart the job without killing all tasks.

finwo commented 2 years ago

When docker's storage is on a NAS that happens to freeze during a deployment, the deployment will fail (wouldn't expect otherwise). After fixing the NAS I'd like to re-deploy without having to alter the job file, which is not possible for now.

In short: being able to restart job allocation without killing all tasks would prevent downtime when issues originate from other sources.

gregory112 commented 2 years ago

Is there a way to even do this currently? I tried all nomad deployment commands and it said it cannot do something with the terminal deployment (can't resume terminal deployment, etc.)

tgross commented 2 years ago

@gregory112 deployments that are complete won't ever get run again. Depending on your specific circumstance the nomad alloc stop command may be able to help you out here by forcing a reschedule of a broken allocation.

gregory112 commented 2 years ago

I give +1 for the nomad job run -force then as it does really help in case there are numerous of allocations that fail, especially those that have more than one instances. We use CI server to deploy most of the jobs and so manually interacting with allocations and stopping them is quite a chore

lattwood commented 1 year ago

If my understanding is correct, a deployment with auto_revert disabled, on a job spec that only reschedules (and doesn't restart), on a long enough timeline will result in the number of running tasks in that deployment becoming 0.

@dadgar -

Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.

Is this called out in the docs anywhere? I just found out this behaviour is the source of some long running problems I'm experiencing, and don't want anyone else to have the same issues.

tgross commented 1 year ago

@lattwood as it turns out we were just talking about that internally and we definitely want to put together a doc that brings together all of deployments, reschedule, restart, and update blocks.

hashicorp / nomad