Suggestion: Improved/optimal autoscaler decisions

dpogorzelski commented 3 years ago

Hello, based on the information I could find it seems like the autoscaler doesn't account for capacity each new server will have and therefore it is unable to scale up quickly in case a ton of pending tasks appears.

In a scenario with X pending tasks where each requires Y amount of resources, if the autoscaler "knew" about how much "room" each server will have, it could make optimal scaling decisions by immediately allocating enough servers to satisfy the current resources need. This would reduce the scale up time to O(1).

At least in GCP the scaling is done on top of MIGs where each MIG uses an instance template which in turn has information about what instance size is used, so in theory it's possible to detect how much "room" each server has and make a scaling "guess" accordingly. In situations where automated size detection is not available, it might be possible to define it manually in the "target {}" block.

my 2 cents :)

dpogorzelski commented 3 years ago

There are some finer details of course which can make all of this a bit harder than it appears, like the fact that 2 servers of the same size, example n1-standard-1, could still have different CPU platforms (Skylake vs Ivy Bridge etc.) which results in different CPU frequency and therefore different CPU capacity but users could pin the instance template to specific CPU models or define each server's capacity via manual override in the target{} block.

lgfa29 commented 3 years ago

Hi @dpogorzelski 👋

Apologies for the delay on getting back to you on this.

From my understanding, this is already possible using, for example, the nomad.client.allocs.cpu.allocated and nomad.nomad.blocked_evals.job.cpu (or their memory equivalent). I will called these allocated and blocked from now own to keep things simple.

As you mentioned, you will also need to provide, the "size" of each host, and this shouldn't be a problem since they are usually created from a template. The way you pass this information to the Autoscaler though, is through your query 🙂

In the scenario, the number of machines you will need will always be equal:

(blocked + allocated) / host_size

In other words, the number of machines you will needs must be equal to the amount of resources you are using plus the amount of resources currently blocked. Since we need a number of hosts, we divide this value by the size of each host. And this is your final policy query 🙂

Let's run some examples to make this formula more clear.

Imagine first this scenario:

┌─────┐ ┌─────┐ ┌─────┐ ┌     ┐  ┌     ┐
│ ☐ ☐ │ │ ☐ ☐ │ │ ☐ ☐ │   ☐ ☐      ☐    
│ ☐ ☐ │ │ ☐ ☐ │ │ ☐ ☐ │   ☐ ☐      ☐    
└─────┘ └─────┘ └─────┘ └     ┘  └     ┘

You have 3 hosts that can hold 4 allocations each, so in total you have 12 allocations running. But you also have 6 allocations blocked pending resources. Intuitively we can see that we will need 2 extra hosts, for a total of 5.

Plopping these values into our query we get:

(blocked + allocated) / host_size = (6 + 12) / 4 = 4.5

Taking the ceiling of this result we end up in our desired 5 instances 🎉

Now let's look at an example of scaling down our cluster. Image this scenario:

┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐  ┌─────┐
│ ☐ ☐ │ │ ☐   │ │     │ │     │  │     │
│ ☐ ☐ │ │ ☐   │ │     │ │     │  │     │
└─────┘ └─────┘ └─────┘ └─────┘  └─────┘

We have 6 allocations running and 0 blocked, witch 5 hosts provisioned. Running our query again we get:

(blocked + allocated) / host_size = (0 + 6) / 4 = 1.5

Taking the ceiling again we have our desired 2 instances 🎉

Since we can tell directly from our query result how many instances we will need, we can simply use the pass-through strategy.

Translating all of this into an actual scaling policy would give us something like this (mostly untested, but it's based on our On-demand Batch Job Cluster Autoscaling tutorial):

scaling "my_policy" {
  min = 0
  max = 5

  policy {
    check "cluster_size" {
      source = "prometheus"

      # Using 2000 Mhz per instance as an example.
      query  = "ceil(sum(nomad_client_allocs_cpu_allocated + nomad_nomad_blocked_evals_job_cpu)/2000)"

      strategy "pass-through" {}
    }

    target "aws-asg" {
      # ...
    }
  }
}

I hope this explanation made sense and answered what you are looking for, but feel free to ask any questions, or let me know if I misunderstood what you are looking for.

hashicorp / nomad-autoscaler

Suggestion: Improved/optimal autoscaler decisions #523