Open nvx opened 4 years ago
Hi @nvx! Just for some additional information, by "unhealthy" do you mean the task had failed and was not rescheduled, or do you mean "is not passing health checks"?
Not passing health checks
Was there any other information you needed @tgross? I notice it still has the waiting-reply label.
Was there any other information you needed @tgross?
Nope, that helps!
So it looks like the behavior you're seeing here isn't totally unexpected but I can see how it can be painful when jobs take a long time to become healthy.
In my instance the healthy alloc was old while the unhealthy alloc was fairly new, I'm not sure if this was coincidence or if the logic prefers to kill older allocs.
There's a bunch of logic we do around allocations that might be part of a running deploy, but the bit that probably happened here is that in scheduler/reconcile.go#L815-L831
we pick the "highest index" allocation to stop first. That's the last allocation of a given deploy, but we reuse the indexes on rescheduling so that NOMAD_ALLOC_INDEX
can be interpolated. So in this situation it was:
You may be able to work-around this kind of issue by using an update
block with a canary
. That would make sure you add an allocation for the new version of the job before shutting down the old one. You'll also want to set a high healthy_deadline
to give deployments time to finish.
I notice it still has the waiting-reply label.
Sorry, we removed the automation that would remove the waiting-reply label because it was overly eager about closing issues. 😀
The job actually did have update canary specified, but because it was a lost alloc not an update it didn't come into play.
Oh right, if only the count
was changed canaries wouldn't come into play at all.
Ok. I'm going to mark this as an enhancement for future discussion, and I've changed the title to match.
Nomad version
Nomad v0.11.1 (b43457070037800fcc8442c8ff095ff4005dab33)
Operating system and Environment details
RHEL 7 x64
Issue
If a running job with count = 2 specified on a group has one healthy alloc, and one unhealthy alloc, changing count to 1 and resubmitting the job killed the one healthy alloc and left the unhealthy alloc. In my instance the healthy alloc was old while the unhealthy alloc was fairly new, I'm not sure if this was coincidence or if the logic prefers to kill older allocs.
In my mind, it would make sense when reducing count to prefer to kill unhealthy allocs first.
Reproduction steps
Start a job with count = 2 on a group Contrive to make one of the allocs be marked as unhealthy - in my case, it is expected that the allocs take a while to startup (tens of minutes) and an old alloc was lost due to the failure of a nomad agent resulting in a replacement alloc being scheduled, but it had not yet successfully started. Reduce the count on the group to 1 Observe that the healthy alloc may have been the one killed instead of the unhealthy alloc