superfly / fly-autoscaler

A metrics-based autoscaler for Fly.io
Apache License 2.0
28 stars 2 forks source link

NATS Metric Collector #26

Open benbjohnson opened 5 months ago

benbjohnson commented 5 months ago

As requested by @gedw99, we should add an integration for NATS so we can autoscale based on its stats.

benbjohnson commented 5 months ago

@gedw99 Can you add some additional context about what you're trying to do and what you want to scale on?

gedw99 commented 4 months ago

Hey @benbjohnson

heres my rough plan / needs / Ideas. If they are too flimsy or just not well enough explained fl free to ask... Autoscaling and NATS is IMHO hugely powerful because you can use NATS as the provider of the metrics from your App on Fly.

  1. NATS blue green for Lame duck mode. So we can do NATS upgrades with no down time on Clusters and Super Clusters. https://docs.nats.io/running-a-nats-service/nats_admin/lame_duck_mode gives the background. But essentially we need to do an Orchestrated upgrade of NATS DOCKER, and make sure we do them in the right blue green way. Maybe this can be achieved with some of the existing Fly Api, but not sure yet.

  2. Logs, Traces and Metrics run through NATS from your Apps. So then we can ( via NATS ) determine what we want to autoscale and in what way. Each Dev will want to do different logic, so I assumed ( maybe wrongly ) that the Fly autoscaler will be importable to other golang apps, and so there a Dev can do whatever crazy logic they need and then get the autoscale to do it. It does beg the question - the just writ golang code to talk to the Fly App and Machine and Volume Api your self ??

Here is a concrete example: https://github.com/choria-io/asyncjobs is a job management system. This is a NATS based job scheduler that needs fly to run NATS and the Jobs. I have a use case where i need to run long running jobs on fly with the fly machines API allowing me to add and remove vms.

In order to do this i need to detect RAM and CPU usage on each server where the async jobs agent runs. Being a NATS freak its easy to have this data ( RAM and CPU ) go to NATS

Logic is:

if server is below 10% utilisation, put server into "blocked" mode, block new job allocations, and move any long running jobs off to servers with 50 to 80% utilisation. Its essentially rebalancing, so we can kill servers. if server has more than 80% utilisation, start a new server, and let it take jobs off the queue. on new job, find server with lowest utilisation that is not "blocked".