hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
429 stars 83 forks source link

How does target-strategy pick a count? #413

Open mrkurt opened 3 years ago

mrkurt commented 3 years ago

We're scaling based on concurrent connections, and it seems like the autoscaler adds a lot of allocations for even a minimal change in our metric.

The scaling policy looks like this:

{
  "cooldown": "2m",
  "check": [
    {
      "tcp-8080": [
        {
          "source": "prometheus",
          "query": "max_over_time(\n  (sum(fly_proxy_service_egress_load{app_id=\"3759\",service=\"app-3759-tcp-8080\"}) / \n  count(count by (alloc, app_id) (nomad_firecracker_vm_cpu{app_id=\"3759\"})))[5m])\n",
          "strategy": [
            {
              "target-value": [
                {
                  "target": 4.0
                }
              ]
            }
          ]
        }
      ]
    }
  ],
  "evaluation_interval": "15s"
}

Here's a graph over time, note the highlighted "load" went from 4.0 to 4.83. I would have liked this to increase the alloc count by 25%, but it seems to have doubled it instead.

image

Is there a reasonable strategy for the target count? I could fashion the query in such a way that it's 1 to 1 for desired allocs with a target of 0, if that helps. I'd really like to incrementally add allocs very quickly but not burst quite so intensely with small changes in the metric count.

lgfa29 commented 3 years ago

Hi @mrkurt,

You are right, looking at the chart you sent the expected count should've been:

(4.83/4.0) * 16 = 19.32 ~= 20

Would you happen to have the output logs of when the scaling took place? Also, would it be possible to tell us which metrics are being used in the plot?

Thank you.

mrkurt commented 3 years ago

Here are some logs from a similar event: image

2021-03-05T23:00:04.462Z [INFO]  policy_eval.worker: scaling target: id=9deb6911-4b84-503f-eb80-e0872719587d policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target from=12 to=13 reason="scaling up because factor is 1.083333" meta=map[nomad_policy_id:d535164b-9710-e968-0fd6-8ffd39bf2781]
2021-03-05T23:00:04.502Z [INFO]  policy_eval.worker: successfully submitted scaling action to target: id=9deb6911-4b84-503f-eb80-e0872719587d policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target desired_count=13
2021-03-05T23:00:04.502Z [INFO]  policy_eval.worker: policy evaluation complete: id=9deb6911-4b84-503f-eb80-e0872719587d policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target
2021-03-05T23:02:04.546Z [INFO]  policy_eval.worker: scaling target: id=e45aba8b-d34f-0b41-f709-837f3854dff0 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target from=13 to=15 reason="scaling up because factor is 1.083333" meta=map[nomad_policy_id:d535164b-9710-e968-0fd6-8ffd39bf2781]
2021-03-05T23:02:04.587Z [INFO]  policy_eval.worker: successfully submitted scaling action to target: id=e45aba8b-d34f-0b41-f709-837f3854dff0 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target desired_count=15
2021-03-05T23:02:04.587Z [INFO]  policy_eval.worker: policy evaluation complete: id=e45aba8b-d34f-0b41-f709-837f3854dff0 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target
2021-03-05T23:04:04.632Z [INFO]  policy_eval.worker: scaling target: id=d9735a5b-50a6-3389-3398-864243ceb5d4 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target from=15 to=18 reason="scaling up because factor is 1.160714" meta=map[nomad_policy_id:d535164b-9710-e968-0fd6-8ffd39bf2781]
2021-03-05T23:04:04.739Z [INFO]  policy_eval.worker: successfully submitted scaling action to target: id=d9735a5b-50a6-3389-3398-864243ceb5d4 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target desired_count=18
2021-03-05T23:04:04.739Z [INFO]  policy_eval.worker: policy evaluation complete: id=d9735a5b-50a6-3389-3398-864243ceb5d4 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target
2021-03-05T23:04:34.444Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=e45aba8b-d34f-0b41-f709-837f3854dff0 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:04:49.445Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=d2c8739c-417b-309f-003d-e691c3a0ba4d policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:05:04.445Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=424f4fc6-1c4f-827e-1110-a27976b01919 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:05:19.443Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=d9735a5b-50a6-3389-3398-864243ceb5d4 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:05:34.445Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=853800e8-4304-ccf8-9424-41d4c09adbe9 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:05:49.445Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=e45aba8b-d34f-0b41-f709-837f3854dff0 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:06:04.443Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=a5df9d3c-fa57-0b41-01e5-111bb8ce5c8f policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:06:04.785Z [INFO]  policy_eval.worker: scaling target: id=424f4fc6-1c4f-827e-1110-a27976b01919 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target from=18 to=21 reason="scaling up because factor is 1.160714" meta=map[nomad_policy_id:d535164b-9710-e968-0fd6-8ffd39bf2781]
2021-03-05T23:06:04.817Z [INFO]  policy_eval.worker: successfully submitted scaling action to target: id=424f4fc6-1c4f-827e-1110-a27976b01919 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target desired_count=21
2021-03-05T23:06:04.817Z [INFO]  policy_eval.worker: policy evaluation complete: id=424f4fc6-1c4f-827e-1110-a27976b01919 policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target
2021-03-05T23:06:19.444Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=11dff9fd-1a32-96df-61fb-5b8dd3e6ce32 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:06:34.443Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=f8679744-af28-573e-7ba5-b2504d57e061 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:06:49.444Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=b8905de0-7169-c3dd-28c6-614a49e75382 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:07:04.444Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=a5df9d3c-fa57-0b41-01e5-111bb8ce5c8f policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:07:19.445Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=424f4fc6-1c4f-827e-1110-a27976b01919 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:07:34.455Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=d9735a5b-50a6-3389-3398-864243ceb5d4 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:07:49.445Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=853800e8-4304-ccf8-9424-41d4c09adbe9 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:08:04.444Z [WARN]  policy_eval.worker.check_handler: no metrics available: check=tcp-8080 id=b8905de0-7169-c3dd-28c6-614a49e75382 policy_id=0f8ffb8d-2320-30ca-4c44-547c4ea61013 queue=horizontal source=prometheus strategy=target-value target=nomad-target
2021-03-05T23:08:04.861Z [INFO]  policy_eval.worker: scaling target: id=a5df9d3c-fa57-0b41-01e5-111bb8ce5c8f policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target from=21 to=25 reason="scaling up because factor is 1.160714" meta=map[nomad_policy_id:d535164b-9710-e968-0fd6-8ffd39bf2781]
2021-03-05T23:08:04.908Z [INFO]  policy_eval.worker: successfully submitted scaling action to target: id=a5df9d3c-fa57-0b41-01e5-111bb8ce5c8f policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target desired_count=25
2021-03-05T23:08:04.908Z [INFO]  policy_eval.worker: policy evaluation complete: id=a5df9d3c-fa57-0b41-01e5-111bb8ce5c8f policy_id=d535164b-9710-e968-0fd6-8ffd39bf2781 queue=horizontal target=nomad-target

The queries in the screenshot are:

count(count by (alloc, app_id) (nomad_firecracker_vm_cpu{app_id="$app_id"})) by (app_id)

and

max_over_time(  (sum(fly_proxy_service_egress_load{app_id="3759",service="app-3759-tcp-8080"}) /   count(count by (alloc, app_id) (nomad_firecracker_vm_cpu{app_id="3759"})))[5m])
mrkurt commented 3 years ago

That stairstep is interesting, seems like maybe the count metric isn't updating before it tries to scale again so the autoscaler might keep adding instances?

lgfa29 commented 3 years ago

Thanks for the extra info @mrkurt.

From the chart it seems like there's a significant delay between the task being scaled up and those extra instances taking effect on the metric value.

The staircase pattern occurs when the metric being tracked doesn't change in reaction to the scaling action itself, so the Autoscaler will keep trying to reduce it by increasing count at each iteration.

This is similar to a thermostat system where the thermometer is stuck, so the AC keeps turning on trying to bring the temperature down.

Since you are using a max_over_time query over a 5m window, I think this means that it would take, at least, 5min for your metric value to go down, so I would suggest either reducing the query rolling window size of increasing your policy's cooldown value and see if that helps.

Another interesting thing that we observed in the chart is that it actually takes more than 5min for the metric to react to a scaling event. For example, the first series of increases (starting at 16:20), took about 15min to actually bring the metric down.

Would you happen to have any intuition as to why this would happen?

mrkurt commented 3 years ago

@lgfa29 that makes some sense. Is there a way to configure it to just run the exact number of allocs a metric returns? There's going to be a delay in our metrics "seeing" new allocs, but I can always compute exactly how many we need total.

I can make target-strategy work with a cool off period, but I'd prefer to add new allocs as fast as humanly possible. Our metrics get scraped every 15s, a new alloc gets booted in however much time that takes, and then it takes another 15s (possibly) to "see" it in metrics.

I'm guessing this would be different strategy than target-strategy?

lgfa29 commented 3 years ago

Yes, that would have to be a new strategy. We are currently collecting use-cases and feedback and will start implementing new plugins soon, so I will make sure this case is in the list (it's a pretty simple one).

In the mean time, if you don't mind doing some "math gymnastics", I think you can set your target to 1 and adjust your metric query so it's divided by the current count. The logic behind it is this:

next_count = (metric / target) * current_count
next_count = ((metric / current_count) / 1) * current_count
next_count = (metric / current_count) * current_count
next_count = metric

I know this is not a good solution, but we should have the new strategies out soon!

mrkurt commented 3 years ago

@lgfa29 I'm not completely following that one! current_count is going to lag just because metrics collection takes a while.

I'll give it a try either way! I actually had a "desired change" version of this with a target of 0. I'll do some testing. :D

mrkurt commented 3 years ago

@lgfa29 just to keep you updated, the core problem we have here is observing allocation counts from metrics is delayed. If we could use a more accurate count, I think this would work really well for us. This may not require a whole new strategy. If we could specify that target-strategy is per alloc instead of a total that should do what we need.

lgfa29 commented 3 years ago

Hi @mrkut, thanks for the update!

the core problem we have here is observing allocation counts from metrics is delayed

Just to make sure I am following along correctly, this delay is caused by the need to have an evaluation_interval. If the Autoscaler had some kind of "push" flow (like described in #405), this wouldn't be a problem. Is this right?

If we could specify that target-strategy is per alloc instead of a total that should do what we need.

I'm sorry if I'm missing something here, but the target inside the target-value strategy is not actually relative to anything, it only depends on the query you are using.

You (simplified) query is

max_over_time(fly_proxy_service_egress_load) / count(nomad_firecracker_vm_cpu)

so your target is 4 "egress loads" per "firecracker VM CPU". If you make your query to return a value per alloc, your target would be per alloc as well.

I think what I'm missing is what fly_proxy_service_egress_load and nomad_firecracker_vm_cpu actually measure, and how (or if) they relate to allocs.

Would you be able to provide a quick summary on what these metrics represent?

mrkurt commented 3 years ago

A push flow would let us work around this, yes!

I'm using count(nomad_firecracker_vm_cpu) by (alloc_id) as a quick way to count allocations, it could be anything though. The problem is, it takes some time for that to be reflected in the metrics. When new allocs get added, they have to boot and then our prometheus scraper needs to "see" the metrics for them.

So what happens is:

  1. The value goes from 4 to 6
  2. Autoscaler goes from 10 to 15
  3. Allocs start to boot, meanwhile the value is still 6 because the nomad_firecracker_vm_cpu hasn't picked up new allocs yet
  4. Autoscaler increases alloc count from 15 to 22
  5. nomad_firecracker_vm_cpu gets new allocs, now count returns 15 (or maybe even 22)
  6. Value drops to 2
  7. Autoscaler decreases count to ~10
  8. Allocs stop very fast, so value increases to 6 again
  9. Repeat

What would work better for us is if we could give you the total target value, and then have the autoscaler divide by taskgroup count. You always have an accurate taskgroup count. If we can use that instead of relying on our lagging metrics we can avoid the rubberband effect.

lgfa29 commented 3 years ago

Thanks for the breakdown @mrkurt, and apologies for delay in getting back to this.

It seems like the push flow would be the biggest win in terms of quickly reacting. I've tentatively scheduled it for our next release, but we might not make it since it still needs quite a bit of investigation, but fingers crossed 🤞

What would work better for us is if we could give you the total target value, and then have the autoscaler divide by taskgroup count.

I think that's the part that still confuses me a bit (sorry 😅). If I understood this right, you could divide whatever metric you are using by nomad.client.allocations.running (or maybe a sum of running and start) to have what you described. Or is it the concern that this metric has a bit of lag until its scraped by your APM?

If the above is enough for your use-case, you can then use the newly release pass-through strategy to send the result directly to your target from the Autoscaler.

mrkurt commented 3 years ago

Oh nice, the pass-through strategy should do what I need. ;)

lgfa29 commented 3 years ago

Great! Let us know how it goes 🙂