Open sam-sla opened 10 months ago
Hi @sam-sla 👋
Do you know what would happen if you tried to scale a MIG in that state? I guess maybe a scale down would work and probably desirable if possible 🤔
But I'm also not sure how to detect this state. Currently we check this isStable
flag to determine if the target is ready for scaling:
https://github.com/hashicorp/nomad-autoscaler/blob/a4e70bcdba4995e4edc24628f6a9806bc33b4270/plugins/builtin/target/gce-mig/plugin/mig.go#L75
Do you know if there's a field we can check for an error like this?
Hi @lgfa29,
No unfortunately I don't, I should have tried to make a new scale call to the Mig myself to see what would happen but I didn't think of that :/ Possibly an explicit call to scale down could have overridden the scale-up it was stuck on.
What I thought was if the autoscaler could check for mig errors if it detected that mig wasn't stable/ready after a while to at least try to log some information? There is a listErrors endpoint https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/listErrors
Hello again, We landed on this same problem today, and this time I tried changing the MIG size in GCP and it worked. After lowering the size of the group, the MIG got into a ready state again and the nomad autoscaler recovered.
So GCP is also being sneaky on this getting stuck constantly trying to scale up even when the cpu quota is hit.
Hello, So this is an edge case we got stuck in very recently. During some increased testing we apparently reached our default vcpu quota in the gcp project, but because we still didn't have alerts configured it went unnoticed for a while, until we realized that the nomad autoscaler wasn't scaling down nodes from the cluster.
The scenario:
QUOTA_EXCEEDED dev-nomad-client-zpgq europe-west1-b Creating Jan 25, 2024, 3:38:35 PM UTC+01:00 Instance 'dev-nomad-client-zpgq' creation failed: Quota 'N2D_CPUS' exceeded. Limit: 500.0 in region europe-west1.
[TRACE] policy_manager.policy_handler: target is not ready: policy_id=c00c0934-a44e-c3eb-5200-e71e44255633
In our case we requested a vcpu increase and that solved the problem, the mig finished the scaling up event that it had started many hours ago, then the nomad autoscaler saw the mig ready and started to function properly again (in our case to scale down as the load had decreased a lot)
I would like to know if this could have been handled better, maybe the autoscaler could check the mig errors, or force a scaling event (not sure if that's possible)