Unexplained gap between Policy Evaluations

peter-lockhart-pub commented 5 months ago

Given a an autoscaler using v0.4.3, with the following setup:

target "aws-asg-us-east-1" {
  driver = "aws-asg"
  config = {
    aws_region = "us-east-1"
  }
}

target "aws-asg-us-west-2" {
  driver = "aws-asg"
  config = {
    aws_region = "us-west-2"
  }
}

policy_eval {
  workers = {
    cluster    = 16
    horizontal = 2
  }
}

and 6 scaling policies following this template:

scaling "cluster_policy" {
  enabled = true
  min     = 3
  max     = 500
  policy {
    cooldown            = "100s"
    evaluation_interval = "1m"

    check "current_ratio_of_allocated_cpu" {
      source       = "prometheus"
      query        = <<EOF
            <query>
      EOF
      query_window = "1m"
      strategy "target-value" {
        target    = 0.6
        threshold = 0.23
      }
    }

    check "historical_ratio_of_allocated_cpu" {
      source       = "prometheus"
      query        = <<EOF
            <query>
      EOF
      query_window = "1m"
      strategy "target-value" {
        target    = 0.6
        threshold = 0.23
      }
    }

    check "historical_max_asg_size_scale_in" {
      source       = "prometheus"
      query        = <<EOF
      <query>
      EOF
      query_window = "1m"
      strategy "pass-through" {}
    }

    target "aws-asg-<data center>" {
      dry-run                = "false"
      aws_asg_name           = "<node_class>-asg"
      node_class             = "<node_class>"
      datacenter             = "<data center>"
      node_drain_deadline    = "10h"
      node_selector_strategy = "empty_ignore_system"
    }
  }
}

We sometimes see policies that are set to evaluate every 1-2 minutes are not getting evaluated that often. The observable behaviour is that we can see in our graphs that the ASG should be scaling out because it is running at a high CPU (as per some of the checks in the policy), but it isn't until much later (e.g. 5-15 minutes later) that the autoscaler evaluates the policy and discovers it needs to scale out. It's hard to tell how often this happens, as the only time we are aware that it happens is when an alert fires due to Nomad Jobs failing to place because Nomad nodes are full - so it may be happening frequently but with less consequences as the rare times where our Nodes get filled.

We have 3 ASGs in us-east-1, and 3 ASGs in us-west-2. So our autoscaler has 6 policies, one for each ASG.

Our policies are set to evaluate every 1-2 minutes, but sometimes we observe that it does not evaluate as frequently as that. Each policy directly maps to 1 ASG. After finding a past reproduction of the issue, I filtered for the problematic policy ID and observed the following logs:

2024-03-28T14:31:25.693Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=historical_max_asg_size_scale_in id=a99e8818-2e3a-12fb-d9e1-06ef7e1babf8 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster source=prometheus strategy=pass-through target=aws-asg-us-east-1 count=5
2024-03-28T14:31:25.693Z [DEBUG] policy_eval.worker: check current_ratio_of_allocated_cpu selected in group : id=a99e8818-2e3a-12fb-d9e1-06ef7e1babf8 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster target=aws-asg-us-east-1 direction=up count=6
2024-03-28T14:31:25.693Z [DEBUG] policy_eval.worker: check current_ratio_of_allocated_cpu selected: id=a99e8818-2e3a-12fb-d9e1-06ef7e1babf8 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster target=aws-asg-us-east-1 direction=up count=6
2024-03-28T14:31:25.693Z [INFO]  policy_eval.worker: scaling target: id=a99e8818-2e3a-12fb-d9e1-06ef7e1babf8 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster target=aws-asg-us-east-1 from=5 to=6 reason="scaling up because factor is 1.175595" meta=map[nomad_policy_id:d4f0fa1c-9679-e9bf-9fd7-6007c024335d]
2024-03-28T14:31:46.614Z [INFO]  internal_plugin.aws-asg-us-east-1: successfully performed and verified scaling out: action=scale_out asg_name=windows-pixel-streaming-rc-docker-lp-cpu-asg desired_count=6
2024-03-28T14:31:46.614Z [DEBUG] policy_eval.worker: successfully submitted scaling action to target: id=a99e8818-2e3a-12fb-d9e1-06ef7e1babf8 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster target=aws-asg-us-east-1 desired_count=6
2024-03-28T14:31:46.614Z [DEBUG] policy_eval.worker: policy evaluation complete: id=a99e8818-2e3a-12fb-d9e1-06ef7e1babf8 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster target=aws-asg-us-east-1
2024-03-28T14:31:46.614Z [DEBUG] policy_eval.broker: ack eval: eval_id=1e586ca0-277b-43b2-1aaf-485ae5185b64 token=3cb9a9c6-8374-cd36-2abf-e063e783b0ed eval_id=1e586ca0-277b-43b2-1aaf-485ae5185b64 token=3cb9a9c6-8374-cd36-2abf-e063e783b0ed
2024-03-28T14:31:46.614Z [DEBUG] policy_eval.broker: eval ack'd: policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d
2024-03-28T14:31:46.614Z [DEBUG] policy_eval.broker: dequeue eval: queue=cluster
2024-03-28T14:31:46.614Z [DEBUG] policy_eval.broker: waiting for eval: queue=cluster
2024-03-28T14:31:46.614Z [DEBUG] policy_manager.policy_handler: scaling policy has been placed into cooldown: policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d cooldown=1m20s

<a whole bunch of other logs for other scaling policies being scaled in, other things working>

2024-03-28T14:40:25.356Z [DEBUG] policy_manager.policy_handler: scaling policy has been placed into cooldown: policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d cooldown=40.055640603s

<a whole bunch of other logs for other scaling policies being scaled in, other things working>

2024-03-28T14:41:25.320Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=01dc66b8-a25c-f9db-9d74-732fe2d2b203 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster token=""
2024-03-28T14:41:25.320Z [DEBUG] policy_eval.broker: eval enqueued: eval_id=01dc66b8-a25c-f9db-9d74-732fe2d2b203 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster token=""
2024-03-28T14:41:25.320Z [DEBUG] policy_eval.broker: eval dequeued: queue=cluster eval_id=01dc66b8-a25c-f9db-9d74-732fe2d2b203 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d token=f9675cbb-8394-bbd7-19a3-1cfd30f8187b
2024-03-28T14:41:25.320Z [DEBUG] policy_eval.worker: received policy for evaluation: id=a654442f-1811-6929-5fe8-0e513ffc4ac2 policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d queue=cluster target=aws-asg-us-east-1

Observe that it is 10 minute again before this is reevaluated. Also observe the big gap in time between the following 2 logs with no other logs with that policy ID being logged inbetween:

2024-03-28T14:31:46.614Z [DEBUG] policy_manager.policy_handler: scaling policy has been placed into cooldown: policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d cooldown=1m20s

2024-03-28T14:40:25.356Z [DEBUG] policy_manager.policy_handler: scaling policy has been placed into cooldown: policy_id=d4f0fa1c-9679-e9bf-9fd7-6007c024335d cooldown=40.055640603s

Given there are more workers than policies, what else could be stopping this from getting evaluated as frequently as it should? What is that delay between the policy being placed in cooldown, and it being queued again? I have the full log lines I can share directly if you think that would help. From a cursory glance through the many log lines there continue to be many policy checks being done for other policies, and 3 scale ins done on other policies scattered between 14:31 and 14:41.

Many thanks.

peter-lockhart-pub commented 5 months ago

Could there be something to do with reusing the targets which act upon different ASGs? E.g. the target

target "aws-asg-us-east-1" {
  driver = "aws-asg"
  config = {
    aws_region = "us-east-1"
  }
}

will be used by different policies:

target "aws-asg-us-east-1" {
      dry-run                = "false"
      aws_asg_name           = "myclass1-asg"
      node_class             = "myclass1"
      datacenter             = "us-east-1"
      node_drain_deadline    = "10h"
      node_selector_strategy = "empty_ignore_system"
    }

target "aws-asg-us-east-1" {
      dry-run                = "false"
      aws_asg_name           = "myclass2-asg"
      node_class             = "myclass2"
      datacenter             = "us-east-1"
      node_drain_deadline    = "10h"
      node_selector_strategy = "empty_ignore_system"
    }

target "aws-asg-us-east-1" {
      dry-run                = "false"
      aws_asg_name           = "myclass3-asg"
      node_class             = "myclass3"
      datacenter             = "us-east-1"
      node_drain_deadline    = "36h"
      node_selector_strategy = "empty_ignore_system"
    }

peter-lockhart-pub commented 5 months ago

Could there be something to do with reusing the targets which act upon different ASGs? E.g. the target

I think this is a red-herring. I have found more reproductions today where a policy is not being evaluated for 8 minutes again (instead of 1min40), and inbetween there are no other scalings of other policies. Logs lines are the same as provided above

jrasell commented 4 months ago

Hi @peter-lockhart-pub and thanks for raising this issue, and apologies for the delayed response. The first thought I had was that it's possible all the workers are busy doing other scaling work as described within https://github.com/hashicorp/nomad-autoscaler/issues/348, but your comment "inbetween there are no other scalings of other policies" suggests this might not be the case.

It would be useful if we could get some pprof data when you experience this issue, particularly the go-routine dump for the noamd-autoscaler application. This API page has information on the available endpoints and required configuration. This can be pasted into this ticket, sent to our atdot email, or added to any internal customer ticket you have open.

I wonder if a followup and slightly related feature would be to add scaling policy priorities, similar to how Nomad evaluations and job priorities work when queueing in the broker.

peter-lockhart-pub commented 4 months ago

Hey @jrasell , I am returning from a long holiday and so will catch up on this ASAP. To complete the loop, there is an internal ticket for this as well 145463

peter-lockhart-pub commented 4 months ago

Thanks @jrasell , I have caught the reproduction and within 10mins I captured some of the debug profiles and attached them to the ticket mentioned above

hashicorp / nomad-autoscaler

Unexplained gap between Policy Evaluations #885