Autoscaler doesn't account for starting instances while scaling and places new allocations even on pending deployment

Rutori commented 2 years ago

While scaling, autoscaler places allocations even after error and doesn't account for instances that are already starting which can lead to overflowing clients with pending instances

Reproduction steps

Create a service group that starts long enough
Make it scale faster than it starts
Set delivery_limit = 1 in policy_eval config

Expected Result

Nomad accounts for already placed allocations including those that had already started and does not scale when there's already a deployment.

Actual Result Even though autoscaler correctly throws error that there's already a deployment, the new allocations are still being placed. Count returned by a strategy plugin is being compared only to started allocations, and as a result autoscaler may place more allocations than a maximum configured amount

Rutori commented 2 years ago

It seems more like a nomad problem https://github.com/hashicorp/nomad/issues/11530 but it could also be happening because autoscaler applies several scaling actions at once - I wasn't being able to reproduce the bug with scaling API endpoint manually. I should also mention that my setup has some issues with connection to Nomad API, and I see in the logs that sometimes one scaling action gets nak'd and being retried again and again, and when it goes through - that's usually when allocations start to duplicate

lgfa29 commented 2 years ago

Hi @Rutori 👋

The Autoscaler would just call the Nomad API, so it's kind of strange that you're seeing different behaviours. I haven't been able to reproduce it locally, so do you have any logs from the Autoscaler that you could share?

hashicorp / nomad-autoscaler

Autoscaler doesn't account for starting instances while scaling and places new allocations even on pending deployment #540