Open jokramer123 opened 3 years ago
thanks, @jokramer123 , we're looking at this and we'll get back to you.
Hi @jokramer123,
Thanks for the detailed report, and I sorry to see the Autoscaler get blocked. It seems like it restarted during a cluster scale in event and left some nodes marked as ineligible without removing them from the ASG.
If there are nodes being modified (e.g. being drained or ineligible for scheduling), the Autoscaler will not modify the cluster to avoid conflicting with other changes that could be undergoing (like some other external automation, or manual maintenance etc). This can be observed in the log message that you highlighted:
node pool status readiness check failed: error="node e02cca8a-6b21-3b1b-3e40-445f0edb6dd3 is ineligible"
https://github.com/hashicorp/nomad-autoscaler/issues/191 is the item we're using to track work for reconciling scaling actions after the Autoscaler is restarted.
To get you unblocked I think you can do these steps:
template
and vault
blocks. I would recommend setting their change_mode
[0][1] to signal
and their change_signal
[0][1] to SIGHUP
, this way the Autoscaler will reload its configuration when their value change.See if these changes help you.
Hello @lgfa29 and @cgbaker,
Thank you for this information. #191 is open for sometime now, definitely this is a must have.
Sometime after the autoscaler respawned it performed a scale-in action against the target autoscaling group which had an active nomad allocation. This results in the allocation becoming 'lost'. I just want to point this out in relation to #191 since this can be pretty disruptive.
Unfortunately the logs I had are gone but I managed to grab a few screenshots:
Thanks for the extra info!
Thank you for this information. #191 is open for sometime now, definitely this is a must have.
Yes, #191 is a bit old and stale because it's something we flagged early as a TODO. It also requires work in Nomad itself that is currently underway.
Sometime after the autoscaler respawned it performed a scale-in action against the target autoscaling group which had an active nomad allocation. This results in the allocation becoming 'lost'. I just want to point this out in relation to #191 since this can be pretty disruptive.
If I understand what you described correctly, the allocation being lost is expected. The Autoscaler will pick a random client to be removed and will then drain it and deregister it from the ASG. We are thinking of ways to allow for better control on which clients get removed on scale in, but in general it's expected that some allocations might have to be reschduled if cluster scaling is being used.
We should make this more clear in our documentation.
To clarify, those allocations are 'lost' because the node was terminated before the drainstop (1 hr) could be completed. To me this is unexpected.
Thanks for the clarification, sorry for missing this point in your last message.
The 1h force drain is a Nomad default, but for the Autoscaler it's set to 15min. You can increase and customize this value per policy.
We will clarify this distinction in our docs as it's can be surprising and unexpected.
Hello,
We are currently using the autoscaler as a Nomad job that runs inside a namespace and has a scaling policy that references an AWS autoscaling group for a target. The autoscaler job is using an IAM role and retrieves a Nomad ACL token from Vault. Eventually, it seems like the nomad-autoscaler stops communicating with the metric source (Datadog).
nomad enterprise v1.0.2. autoscaler v0.3.0
I notice this happens after the autoscaler has performed a number of scaling activities on the autoscaling group, eventually there are a number of nomad nodes that are ineligible (since their work was done and the autoscaler drain stopped them and terminated them).
In the log the autoscaler seems to be stuck evaluating these nodes and never begins querying for the metric to scale up again:
Over the last 30 minutes the autoscaler has effectively failed to scale up even though their are valid jobs queued and not able to be allocated:
Even if I stopped the autoscaler and respawn it, it re-enters this state and cannot conduct scaling activities.
Any help would be appreciated.
Here is my nomad job which runs the autoscaler:
Here is a full log of the autoscaler just getting stuck and not doing anything:
Thank you