hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
427 stars 83 forks source link

Autoscaler forever waiting for gce-mig "target is not ready" when cpu quota is exceeded #839

Open sam-sla opened 10 months ago

sam-sla commented 10 months ago

Hello, So this is an edge case we got stuck in very recently. During some increased testing we apparently reached our default vcpu quota in the gcp project, but because we still didn't have alerts configured it went unnoticed for a while, until we realized that the nomad autoscaler wasn't scaling down nodes from the cluster.

The scenario:

  1. gce-mig is trying to scale up but vcpu quota in the project is reached, the instance group has errors of the type QUOTA_EXCEEDED dev-nomad-client-zpgq europe-west1-b Creating Jan 25, 2024, 3:38:35 PM UTC+01:00 Instance 'dev-nomad-client-zpgq' creation failed: Quota 'N2D_CPUS' exceeded. Limit: 500.0 in region europe-west1.
  2. Nomad autoscaler is periodically checking the mig but doesn't go further because the mig is not ready (stuck trying to scale-up) [TRACE] policy_manager.policy_handler: target is not ready: policy_id=c00c0934-a44e-c3eb-5200-e71e44255633

In our case we requested a vcpu increase and that solved the problem, the mig finished the scaling up event that it had started many hours ago, then the nomad autoscaler saw the mig ready and started to function properly again (in our case to scale down as the load had decreased a lot)

I would like to know if this could have been handled better, maybe the autoscaler could check the mig errors, or force a scaling event (not sure if that's possible)

lgfa29 commented 9 months ago

Hi @sam-sla 👋

Do you know what would happen if you tried to scale a MIG in that state? I guess maybe a scale down would work and probably desirable if possible 🤔

But I'm also not sure how to detect this state. Currently we check this isStable flag to determine if the target is ready for scaling: https://github.com/hashicorp/nomad-autoscaler/blob/a4e70bcdba4995e4edc24628f6a9806bc33b4270/plugins/builtin/target/gce-mig/plugin/mig.go#L75

Do you know if there's a field we can check for an error like this?

sam-sla commented 9 months ago

Hi @lgfa29,

No unfortunately I don't, I should have tried to make a new scale call to the Mig myself to see what would happen but I didn't think of that :/ Possibly an explicit call to scale down could have overridden the scale-up it was stuck on.

What I thought was if the autoscaler could check for mig errors if it detected that mig wasn't stable/ready after a while to at least try to log some information? There is a listErrors endpoint https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/listErrors

sam-sla commented 7 months ago

Hello again, We landed on this same problem today, and this time I tried changing the MIG size in GCP and it worked. After lowering the size of the group, the MIG got into a ready state again and the nomad autoscaler recovered.

So GCP is also being sneaky on this getting stuck constantly trying to scale up even when the cpu quota is hit.