Nomad autoscaling - Githubissues

manveru commented 3 years ago

In order to actually make better use of our new clusters we should take advantage of autoscaling the AWS autoscaling groups. Right now they are hardcoded using the terraform core workspaces.

My proposed approach is to run an instance of the nomad autoscaling daemon on the monitoring server since it will have access to metrics across the cluster already. I haven't had time to look into this yet, so I'm not able to determine what kind of configuration this will require, but the core instances will most likely need additional IAM privileges for controlling the amount of instances in the groups. Those are still in the per-cluster iam.nix file. (which we should probably pull into bitte proper).

[x] Update IAM permissions for core group with whatever is needed for autscaling
[x] Move iam.nix to bitte repo and import from clusters.
[ ] Ensure the new ZFS-based ASG AMIs are available in all regions (they can only be copied in the development aws org).
[x] Update the package for nomad-autoscaler in this repo.
[x] Write a NixOS module for it (probably heavily based on https://github.com/hashicorp/nomad-autoscaler/blob/master/demo/remote/aws_autoscaler.nomad and maybe running a consul-template for the config).
[ ] Determine what metrics other than CPU/Memory we need to decide on the required number of instances.
[ ] Bonus points for boosting the startup performance of the ASG AMIs by unifying the caching and utilizing seaweedfs.

See https://github.com/hashicorp/nomad-autoscaler

jonringer commented 3 years ago

I think most of this has been satisfied

cc @nrdxp

nrdxp commented 3 years ago

Looks like a lot of these old issue are pretty stale :sweat_smile:

I checked what should already be done. Moving the instance to monitoring may actually be a good idea since it can pull the data locally instead of over the network, but I'm not sure if that would actually change anything substantially. I dunno if anyone has gotten around to fixing the AMI since the recent ZFS breaks though.

Right now we are only using CPU and memory to determine scaling actions so there is still potential to improve this aspect. And I would probably add another point:

[ ] setup for dynamically scaling Nomad task groups

input-output-hk / bitte

Nomad autoscaling #6