hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
426 stars 84 forks source link

keep scaling when nodes are draining #672

Open janory opened 1 year ago

janory commented 1 year ago

Hi! 👋

We recently started to use the Nomad Autoscaler agent and we really like it. 🚀 We are using the Autoscaler with the Nomad APM, aws-asg target and target-value strategy plugins.

We have multiple long running (1-45 minutes) batch jobs on our nodes and when a scale in action happens the drain event won't finish until the last batch job completes on the node.

This leads to constant warning messages like this:

2023-07-18T13:17:01.646Z [TRACE] policy_manager.policy_handler: target is not ready: policy_id=4a1d5af4-323a-d939-d208-18672288565c
2023-07-18T13:17:01.646Z [WARN] internal_plugin.aws-asg: node pool status readiness check failed: error="node 872ae150-f1a2-12b1-2197-cd32a3b49546 is draining"
2023-07-18T13:17:01.642Z [TRACE] policy_manager.policy_handler: getting target status: policy_id=4a1d5af4-323a-d939-d208-18672288565c
2023-07-18T13:17:01.642Z [TRACE] policy_manager.policy_handler: tick: policy_id=4a1d5af4-323a-d939-d208-18672288565c

because the Autoscaler implicitly checks the ASG target's status for each tick (handleTick -> generateEvaluation -> Status -> IsPoolReady -> FilterNodes -> if node.Drain).

Based on the comment here and also based on what we are experiencing the Autoscaler stops any further scaling actions until all draining activities are completed.

This is an issue for us, because in worst case scenario the long running batch jobs will prevent us scaling for 45 minutes.

Would it be possible to add a config for the idFn function to filter out draining nodes and keep scaling?

We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.

janory commented 1 year ago

I was thinking about something like this: https://github.com/hashicorp/nomad-autoscaler/pull/679 Although this alone probably won't be enough, because even if this part passes, the processLastActivity call would set the Ready flag to false.

tgross commented 1 year ago

Hi @janory!

We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.

I think the major challenge here is that the nodes might be draining for reasons outside the control of the autoscaler. Maybe you've run nomad node drain -enable :node_id out of band so that software on the host can be upgraded, and then plan is to return that immediately to work afterwards. Or maybe the host is having unrecoverable problems unrelated to scale in/out and you've drained it so that you can decommission it afterwards. Either way, the autoscaler would need to know whether to count that node in the total capacity or not.

If we do decide to ignore this check, then we need to adjust our expectations of what plugins return as node count. For example, if there are 5 instances in a ASG, but 2 are draining, maybe the policy calculation should only count 3 nodes to account for either of those two situations?

douglaje commented 1 year ago

Hi @tgross , we've run into this issue/constraint as well. After moving to AWS spot instances which can receive interruption notices at any moment (and nearly continuously if you've got a large enough mixed cluster), our autoscaler would stop scaling for up to a half hour at a time (due to any node in the cluster being draining/initializing/other-than-ready) and we'd totally blow our SLA.

For us, the bigger sin than not scaling exactly is to not scale quickly. We don't mind underestimating capacity so we've customized the aws_asg and nomad_apm plugins so FilterNodes no longer errors on non-ready nodes (it excludes them instead).

It might be nice to be able to provide a strictness=ignore_unstable to the autoscaler plugins to be able to selectively override certain cautious behaviors built into the autoscaler, but part of the problem is that this check happens in nearly every plugin (both the apm and target plugins for us) and my Golang experience is minimal at best.

lgfa29 commented 9 months ago

Thank you for extra input @douglaje.

I've experimented with bypassing these checks but I'm still unsure about their impact. The biggest blocker here is that a policy is not allowed to be evaluated in parallel, meaning that only a single scaling action is allowed happen at time. But if you have multiple policies targeting the same set of nodes, or if the scaling action takes so long that evaluation times out, then this can be bypassed as well.

I've opened #811 to start some discussion around this. As I mentioned, I'm still unsure about it, so I'm at least marking these new configuration as experimental and we will probably not document them for now. If you would be willing to try them we could perhaps consider merging it.

For reference, this is the policy file I used for test. I had split the scaling up and down into two different policies so the actions could, in theory, happen at the same time. Another thing that is important about the AWS ASG target plugin is that the ASG events also affect their cooldown, so you also need different values there.

scaling "cluster_up" {
  enabled = true
  min     = 1
  max     = 4

  policy {
    cooldown            = "3s"
    evaluation_interval = "10s"

    check "up" {
      source = "prometheus"
      query  = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"

      strategy "threshold" {
        lower_bound = 3.9
        delta       = 1
      }
    }

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "10m"

      # EXPERIMENTAL.
      node_filter_ignore_drain = true
      ignore_asg_events        = true
    }
  }
}

scaling "cluster_down" {
  enabled = true
  min     = 1
  max     = 4

  policy {
    cooldown            = "10s"
    evaluation_interval = "10s"

    check "down" {
      source = "prometheus"
      query  = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"

      strategy "threshold" {
        upper_bound = 3.1
        delta       = -1
      }
    }

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "10m"

      # EXPERIMENTAL.
      node_filter_ignore_drain = true
      ignore_asg_events        = true
    }
  }
}