hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
423 stars 84 forks source link

Autoscaler doesn't account for starting instances while scaling and places new allocations even on pending deployment #540

Open Rutori opened 2 years ago

Rutori commented 2 years ago

While scaling, autoscaler places allocations even after error and doesn't account for instances that are already starting which can lead to overflowing clients with pending instances

Reproduction steps

  1. Create a service group that starts long enough
  2. Make it scale faster than it starts
  3. Set delivery_limit = 1 in policy_eval config

Expected Result

Nomad accounts for already placed allocations including those that had already started and does not scale when there's already a deployment.

Actual Result Even though autoscaler correctly throws error that there's already a deployment, the new allocations are still being placed. Count returned by a strategy plugin is being compared only to started allocations, and as a result autoscaler may place more allocations than a maximum configured amount

Rutori commented 2 years ago

It seems more like a nomad problem https://github.com/hashicorp/nomad/issues/11530 but it could also be happening because autoscaler applies several scaling actions at once - I wasn't being able to reproduce the bug with scaling API endpoint manually. I should also mention that my setup has some issues with connection to Nomad API, and I see in the logs that sometimes one scaling action gets nak'd and being retried again and again, and when it goes through - that's usually when allocations start to duplicate

lgfa29 commented 2 years ago

Hi @Rutori 👋

The Autoscaler would just call the Nomad API, so it's kind of strange that you're seeing different behaviours. I haven't been able to reproduce it locally, so do you have any logs from the Autoscaler that you could share?