Open janory opened 1 year ago
I was thinking about something like this: https://github.com/hashicorp/nomad-autoscaler/pull/679
Although this alone probably won't be enough, because even if this part passes, the processLastActivity call would set the Ready
flag to false
.
Hi @janory!
We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.
I think the major challenge here is that the nodes might be draining for reasons outside the control of the autoscaler. Maybe you've run nomad node drain -enable :node_id
out of band so that software on the host can be upgraded, and then plan is to return that immediately to work afterwards. Or maybe the host is having unrecoverable problems unrelated to scale in/out and you've drained it so that you can decommission it afterwards. Either way, the autoscaler would need to know whether to count that node in the total capacity or not.
If we do decide to ignore this check, then we need to adjust our expectations of what plugins return as node count. For example, if there are 5 instances in a ASG, but 2 are draining, maybe the policy calculation should only count 3 nodes to account for either of those two situations?
Hi @tgross , we've run into this issue/constraint as well. After moving to AWS spot instances which can receive interruption notices at any moment (and nearly continuously if you've got a large enough mixed cluster), our autoscaler would stop scaling for up to a half hour at a time (due to any node in the cluster being draining/initializing/other-than-ready) and we'd totally blow our SLA.
For us, the bigger sin than not scaling exactly is to not scale quickly. We don't mind underestimating capacity so we've customized the aws_asg and nomad_apm plugins so FilterNodes
no longer errors on non-ready nodes (it excludes them instead).
It might be nice to be able to provide a strictness=ignore_unstable
to the autoscaler plugins to be able to selectively override certain cautious behaviors built into the autoscaler, but part of the problem is that this check happens in nearly every plugin (both the apm and target plugins for us) and my Golang experience is minimal at best.
Thank you for extra input @douglaje.
I've experimented with bypassing these checks but I'm still unsure about their impact. The biggest blocker here is that a policy is not allowed to be evaluated in parallel, meaning that only a single scaling action is allowed happen at time. But if you have multiple policies targeting the same set of nodes, or if the scaling action takes so long that evaluation times out, then this can be bypassed as well.
I've opened #811 to start some discussion around this. As I mentioned, I'm still unsure about it, so I'm at least marking these new configuration as experimental and we will probably not document them for now. If you would be willing to try them we could perhaps consider merging it.
For reference, this is the policy file I used for test. I had split the scaling up and down into two different policies so the actions could, in theory, happen at the same time. Another thing that is important about the AWS ASG target plugin is that the ASG events also affect their cooldown, so you also need different values there.
scaling "cluster_up" {
enabled = true
min = 1
max = 4
policy {
cooldown = "3s"
evaluation_interval = "10s"
check "up" {
source = "prometheus"
query = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"
strategy "threshold" {
lower_bound = 3.9
delta = 1
}
}
target "aws-asg" {
dry-run = "false"
aws_asg_name = "hashistack-nomad_client"
node_class = "hashistack"
node_drain_deadline = "10m"
# EXPERIMENTAL.
node_filter_ignore_drain = true
ignore_asg_events = true
}
}
}
scaling "cluster_down" {
enabled = true
min = 1
max = 4
policy {
cooldown = "10s"
evaluation_interval = "10s"
check "down" {
source = "prometheus"
query = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"
strategy "threshold" {
upper_bound = 3.1
delta = -1
}
}
target "aws-asg" {
dry-run = "false"
aws_asg_name = "hashistack-nomad_client"
node_class = "hashistack"
node_drain_deadline = "10m"
# EXPERIMENTAL.
node_filter_ignore_drain = true
ignore_asg_events = true
}
}
}
Hi! 👋
We recently started to use the Nomad Autoscaler agent and we really like it. 🚀 We are using the Autoscaler with the
Nomad APM
,aws-asg
target andtarget-value
strategy plugins.We have multiple long running (1-45 minutes) batch jobs on our nodes and when a scale in action happens the drain event won't finish until the last batch job completes on the node.
This leads to constant warning messages like this:
because the Autoscaler implicitly checks the ASG target's status for each tick (
handleTick -> generateEvaluation -> Status -> IsPoolReady -> FilterNodes -> if node.Drain
).Based on the comment here and also based on what we are experiencing the Autoscaler stops any further scaling actions until all draining activities are completed.
This is an issue for us, because in worst case scenario the long running batch jobs will prevent us scaling for 45 minutes.
Would it be possible to add a config for the
idFn
function to filter out draining nodes and keep scaling?We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.