stop unhealthy allocs first when reducing count

nvx commented 4 years ago

Nomad version

Nomad v0.11.1 (b43457070037800fcc8442c8ff095ff4005dab33)

Operating system and Environment details

RHEL 7 x64

Issue

If a running job with count = 2 specified on a group has one healthy alloc, and one unhealthy alloc, changing count to 1 and resubmitting the job killed the one healthy alloc and left the unhealthy alloc. In my instance the healthy alloc was old while the unhealthy alloc was fairly new, I'm not sure if this was coincidence or if the logic prefers to kill older allocs.

In my mind, it would make sense when reducing count to prefer to kill unhealthy allocs first.

Reproduction steps

Start a job with count = 2 on a group Contrive to make one of the allocs be marked as unhealthy - in my case, it is expected that the allocs take a while to startup (tens of minutes) and an old alloc was lost due to the failure of a nomad agent resulting in a replacement alloc being scheduled, but it had not yet successfully started. Reduce the count on the group to 1 Observe that the healthy alloc may have been the one killed instead of the unhealthy alloc

tgross commented 4 years ago

Hi @nvx! Just for some additional information, by "unhealthy" do you mean the task had failed and was not rescheduled, or do you mean "is not passing health checks"?

nvx commented 4 years ago

Not passing health checks

nvx commented 4 years ago

Was there any other information you needed @tgross? I notice it still has the waiting-reply label.

tgross commented 4 years ago

Was there any other information you needed @tgross?

Nope, that helps!

So it looks like the behavior you're seeing here isn't totally unexpected but I can see how it can be painful when jobs take a long time to become healthy.

In my instance the healthy alloc was old while the unhealthy alloc was fairly new, I'm not sure if this was coincidence or if the logic prefers to kill older allocs.

There's a bunch of logic we do around allocations that might be part of a running deploy, but the bit that probably happened here is that in scheduler/reconcile.go#L815-L831 we pick the "highest index" allocation to stop first. That's the last allocation of a given deploy, but we reuse the indexes on rescheduling so that NOMAD_ALLOC_INDEX can be interpolated. So in this situation it was:

job version 1 is registered
allocation 1 is placed and becomes healthy
allocation 2 is placed and becomes healthy
allocation 2's node is lost
allocation 2 is rescheduled but hasn't become healthy yet
job version 2 is registered, reducing the count
allocation 2 is stopped

You may be able to work-around this kind of issue by using an update block with a canary. That would make sure you add an allocation for the new version of the job before shutting down the old one. You'll also want to set a high healthy_deadline to give deployments time to finish.

I notice it still has the waiting-reply label.

Sorry, we removed the automation that would remove the waiting-reply label because it was overly eager about closing issues. 😀

nvx commented 4 years ago

The job actually did have update canary specified, but because it was a lost alloc not an update it didn't come into play.

tgross commented 4 years ago

Oh right, if only the count was changed canaries wouldn't come into play at all.

Ok. I'm going to mark this as an enhancement for future discussion, and I've changed the title to match.

hashicorp / nomad