Open peter-lockhart-pub opened 5 months ago
Could there be something to do with reusing the targets
which act upon different ASGs? E.g. the target
target "aws-asg-us-east-1" {
driver = "aws-asg"
config = {
aws_region = "us-east-1"
}
}
will be used by different policies:
target "aws-asg-us-east-1" {
dry-run = "false"
aws_asg_name = "myclass1-asg"
node_class = "myclass1"
datacenter = "us-east-1"
node_drain_deadline = "10h"
node_selector_strategy = "empty_ignore_system"
}
target "aws-asg-us-east-1" {
dry-run = "false"
aws_asg_name = "myclass2-asg"
node_class = "myclass2"
datacenter = "us-east-1"
node_drain_deadline = "10h"
node_selector_strategy = "empty_ignore_system"
}
target "aws-asg-us-east-1" {
dry-run = "false"
aws_asg_name = "myclass3-asg"
node_class = "myclass3"
datacenter = "us-east-1"
node_drain_deadline = "36h"
node_selector_strategy = "empty_ignore_system"
}
Could there be something to do with reusing the
targets
which act upon different ASGs? E.g. the target
I think this is a red-herring. I have found more reproductions today where a policy is not being evaluated for 8 minutes again (instead of 1min40), and inbetween there are no other scalings of other policies. Logs lines are the same as provided above
Hi @peter-lockhart-pub and thanks for raising this issue, and apologies for the delayed response. The first thought I had was that it's possible all the workers are busy doing other scaling work as described within https://github.com/hashicorp/nomad-autoscaler/issues/348, but your comment "inbetween there are no other scalings of other policies" suggests this might not be the case.
It would be useful if we could get some pprof data when you experience this issue, particularly the go-routine dump for the noamd-autoscaler application. This API page has information on the available endpoints and required configuration. This can be pasted into this ticket, sent to our
I wonder if a followup and slightly related feature would be to add scaling policy priorities, similar to how Nomad evaluations and job priorities work when queueing in the broker.
Hey @jrasell , I am returning from a long holiday and so will catch up on this ASAP. To complete the loop, there is an internal ticket for this as well 145463
Thanks @jrasell , I have caught the reproduction and within 10mins I captured some of the debug profiles and attached them to the ticket mentioned above
Given a an autoscaler using v0.4.3, with the following setup:
and 6 scaling policies following this template:
We sometimes see policies that are set to evaluate every 1-2 minutes are not getting evaluated that often. The observable behaviour is that we can see in our graphs that the ASG should be scaling out because it is running at a high CPU (as per some of the checks in the policy), but it isn't until much later (e.g. 5-15 minutes later) that the autoscaler evaluates the policy and discovers it needs to scale out. It's hard to tell how often this happens, as the only time we are aware that it happens is when an alert fires due to Nomad Jobs failing to place because Nomad nodes are full - so it may be happening frequently but with less consequences as the rare times where our Nodes get filled.
We have 3 ASGs in us-east-1, and 3 ASGs in us-west-2. So our autoscaler has 6 policies, one for each ASG.
Our policies are set to evaluate every 1-2 minutes, but sometimes we observe that it does not evaluate as frequently as that. Each policy directly maps to 1 ASG. After finding a past reproduction of the issue, I filtered for the problematic policy ID and observed the following logs:
Observe that it is 10 minute again before this is reevaluated. Also observe the big gap in time between the following 2 logs with no other logs with that policy ID being logged inbetween:
Given there are more workers than policies, what else could be stopping this from getting evaluated as frequently as it should? What is that delay between the policy being placed in cooldown, and it being queued again? I have the full log lines I can share directly if you think that would help. From a cursory glance through the many log lines there continue to be many policy checks being done for other policies, and 3 scale ins done on other policies scattered between 14:31 and 14:41.
Many thanks.