hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
422 stars 85 forks source link

`node_class` is required for target checks even when `datacenter` is set #592

Open protochron opened 2 years ago

protochron commented 2 years ago

It seems that even though the code doesn't require it, the autoscaler will not select any nodes unlessnode_class is set in a target. The check errors out with a message failed to query source: no nodes identified within pool. I had hoped to let the autoscaler manage all nodes in a single datacenter, since both the docs and the code imply that is supported.

Example config:

scaling "aws_cluster_policy" {
  enabled = true
  min     = 3
  max     = 20

  policy {
    cooldown            = "2m"
    evaluation_interval = "30s"

    check "cluster_memory" {
      source = "nomad-apm"
      query  = "percentage-allocated_memory"

      strategy "threshold" {
        lower_bound = 90
        delta       = 1
      }
    }

    target "constellations" {
      aws_asg_name = "workers"
      datacenter   = "workers"
      node_purge   = true
      node_class   = "test"
      dry_run      = true
    }
  }
}

Output:

2022-07-06T15:20:58.985Z [DEBUG] policy_eval.worker: fetching current count: id=2efc8c99-1fd5-771b-34ab-05f19e528be4 policy_id=a671710f-0d33-0cd4-9bf8-36cfcf17459e queue=cluster target=workers
2022-07-06T15:20:59.229Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=cluster_memory id=2efc8c99-1fd5-771b-34ab-05f19e528be4 policy_id=a671710f-0d33-0cd4-9bf8-36cfcf17459e queue=cluster source=nomad-apm strategy=threshold target=workers
2022-07-06T15:20:59.229Z [DEBUG] policy_eval.worker.check_handler: querying source: check=cluster_memory id=2efc8c99-1fd5-771b-34ab-05f19e528be4 policy_id=a671710f-0d33-0cd4-9bf8-36cfcf17459e queue=cluster source=nomad-apm strategy=threshold target=workers query=node_percentage-allocated_memory//class source=nomad-apm
2022-07-06T15:20:59.229Z [DEBUG] internal_plugin.nomad-apm: performing node pool APM query: query=node_percentage-allocated_memory//class
2022-07-06T15:20:59.233Z [WARN]  policy_eval.worker: failed to run check: id=2efc8c99-1fd5-771b-34ab-05f19e528be4 policy_id=a671710f-0d33-0cd4-9bf8-36cfcf17459e queue=cluster target=workers check=cluster_memory on_error="" on_check_error="" error="failed to query source: no nodes identified within pool"
2022-07-06T15:20:59.233Z [DEBUG] policy_eval.worker: no checks need to be executed: id=2efc8c99-1fd5-771b-34ab-05f19e528be4 policy_id=a671710f-0d33-0cd4-9bf8-36cfcf17459e queue=cluster target=workers

Setting the node_class field fixes the error and the autoscaler is able to identify nodes to manage with the policy. The fix in my case is pretty straightforward: just set the same node_class value on every node in the datacenter. But is this behavior intentional? It doesn't seem like it if the code in https://github.com/hashicorp/nomad-autoscaler/blob/main/sdk/helper/scaleutils/nodepool/nodepool.go#L35-L47 is anything to go by.

protochron commented 2 years ago

https://github.com/hashicorp/nomad-autoscaler/issues/255 seems related, but it's pretty old and I think predates being able to filter nodes by datacenter.

lgfa29 commented 1 year ago

Thanks for the report @protochron, and apologies for taking this long to get back to you.

I will need some time to investigate this further, but yeah, I think node_class is only one of the possible node selection options, so it should be optional.

And thanks for pointing ou #255, I will try to tackle that documentation gap as well 🙂