Automatic live-migration to balance load on cluster

stgraber commented 8 months ago

When running a cluster with one or more virtual-machines that's capable of being live-migrated across the cluster, we should be able to use that to better spread the load across a cluster by evaluating server load across the cluster and deciding whether to automatically move some workloads to re-balance things.

We already have a lot of the right pieces in place:

Ability to live-migrate
Internal APIs to query load and hardware resources of the servers
Flexible scheduler (through scriptlet) to handle placement decisions

We will need to think through ways to avoid instances flip-flopping between servers as well as ways to mitigate the migration itself causing significant load difference on both the source and target server.

One approach would be to only perform a single migration per cycle, while also preventing an instance from being moved again for a number of cycles since its last move.

Ideally we'd be leveraging calls to our existing scheduler to find new locations for existing instances, only considering instances that can be easily live-migrated (no local storage, no local devices, ...).

dontlaugh commented 8 months ago

One approach would be to only perform a single migration per cycle, while also preventing an instance from being moved again for a number of cycles since its last move.

What does "cycle" refer to in this context?

stgraber commented 8 months ago

A cycle would be a scheduled task inside Incus basically.

The admin would basically instruct Incus to consider automatic balancing every 15min or every hour or every 3 hours, whatever makes sense based on their environment.

sophiezhangg commented 7 months ago

Hello!

My partners and I are currently studying Virtualization in UT Austin and would like to have more experience contributing to open-source repos like this - we are wondering if this issue is still open for solving? If so, we would love to take a chance and work on this because it seems very interesting to us! No worries if not!

Thank you so much!

sophiezhangg commented 6 months ago

Hello @stgraber!

My team just had discussion about how to approach this ticket and we are wondering if you mind providing tips, suggestions, and/or clarifications on our approach?

Here is our general logic:

We think we will need to write a scriptlet for a custom instance_placement function - and the core idea is that given a request and a list of candidate members, we will loop through the candidate_members and acquire the server and its matching load / resource information. Then we will select the server that has the lowest number of instances currently running on it. After that, we see that if the request's reason field is relocation. If it is and the server load is lower than the current server load that the request is on, then we set the target location to that server. However, if the relocation reason is set and the server load is the same, then we don't do anything. One question we have about this is that we saw mentions of the default scheduler scriptlet but we aren't able to find it code-wise - we are wondering if you know where it is / if you mind providing it?
We are thinking of calling this scriptlet logic using instancePlacementRun in the loop function in task.go as we saw that this function matches with your definition of "cycle" up above in your discussion with dontlaugh (but let us know if there are some misunderstanding) - we are thinking first we determine if the task is easily migratable (as defined in your specs - no local storage device etc. also if there is any stricter requirements on "easily migratable" we would love to take that into account as well) - and if it is, we will call instancePlacementRun - we are assuming there is also a function that can let us know the local storage / local devices information for a particular task - also, we are thinking of adding a bool field in the Task struct to give a user the power to turn on / off live migration, as well as a integer value called coolDown to mitigate flip-flopping between servers. Now every time loop is called, we decrement coolDown by 1 - when it is 0 and the user set automatic live migration to true, we will execute the above logic.

Does this sound like we are on the right track?

Thank you so much for your help!

devr0306 commented 6 months ago

Hey @stgraber! We just wanted to follow up to see if you had any questions or feedback on our approach before we started implementing it.

stgraber commented 6 months ago

I don't think it really makes sense for this to be a scriptlet. Scriptlets are there for cases where we want the administrator to have a way to customize our internal logic, like is done with the placement logic.

Instead here, what we need really is:

Extra configuration keys:
- cluster.rebalance.frequency => How often to consider re-balancing things
- cluster.rebalance.threshold => Load difference between most and least busy server needed to trigger a migration (default 20%)
- cluster.rebalance.cooldown => Amount of time during which an instance will not be moved again (default 1h)
- cluster.rebalance.batch => Maximum number of instances to move during one re-balancing run
Spawn a background task similar to the one handling hearbeats which similarly is skipped on any server that's not the current leader
When the task runs, retrieve server information (GetResources) on all online servers in the cluster
From that data, calculate a per-server score (effectively how busy it is based on CPU, Memory and load average)
Split the servers by CPU architecture, then sort the servers by their score and check difference between most and least busy for each architecture, exit the re-balance if the difference is less than the threshold
Pull the list of instances from the most busy server and filter out those that cannot be live migrated (CanMigrate)
Determine what instances should be migrated to equalize the score between the two servers looking at their current memory usage and CPU allocation and migrate them

I'd recommend you start by doing the paperwork stuff, so basically:

api: cluster_rebalance (changes to doc/api-extensions.md and internal/version/api.go)
incusd/cluster/config: Add cluster re-balance configuration keys (changes to internal/server/cluster/config/config.go)
doc: Update configs (result of make update-metadata)
incusd: Add cluster rebalance task (look at autoHealClusterTask for an example)

That last commit is going to be the big one as far as logic goes, but you can slowly grow it as you go, starting with just logging something to say the task would run, then grow that to include details about all servers and load, to their instances and eventually what moves would happen.

There are a few things to be careful about:

As mentioned above, CPU architectures. We can't migrate between Intel and Arm, so the balancing logic must consider each CPU architecture independently.
There may be restrictions on where a given instance can go, for example its project may be restricted to specific servers, or the instance may have been placed on a specific group of servers. So you'll still need to call the scheduler and pass it the desired target to validate that it's actually allowed.

sophiezhangg commented 6 months ago

Hello @stgraber!

My group have been working on this issue and we have gotten to the section where we are considering what instances to migrate and explicitly migrating them. First, we check if an instance from dbCluster instances is migratable by using canMigrate() == "live-migrate", if so, we append this instance to a list that is type instance.Instance. However, we are having a bit of trouble seeing how to get the resources the instance is consuming with this type, as we see it originating from the instance_interface.go file and there weren't relevant fields in the Instance struct there. However, we do notice that type api.InstanceState has what we are looking for so we are wondering if there is any way to use instance.Instance to get api.InstanceState?

Another question we have is if you think our current way of calculating the effective score per server is good for the score balancing calculation we will do per instance? We are currently doing some division in calculating memory ((totalRAM - freeRAM) / totalRAM) used and CPU usage (numProc / numCores) and we fear it might be overcomplicating things a bit?

Lastly, for our migration - we are planning to use a logic similar to migrateFunc from clusterStateNodePost to do actual migration - do you think this is sufficient in dealing with the scheduler validation issue you mentioned or if we need to explicitly do so? We have looked in instance_placement.go and scheduler.go for this but we don't think this holds relevancy here.

We tried to implement this in the forked repo https://github.com/sophiezhangg/cs360v-incus/tree/cs360v-automigration in the func autoClusterRebalanceTask in api_cluster.go and are wondering if you have any suggestions / advice for us? We would greatly appreciate it!

Thank you so much for your help!

stgraber commented 6 months ago

My group have been working on this issue and we have gotten to the section where we are considering what instances to migrate and explicitly migrating them. First, we check if an instance from dbCluster instances is migratable by using canMigrate() == "live-migrate", if so, we append this instance to a list that is type instance.Instance. However, we are having a bit of trouble seeing how to get the resources the instance is consuming with this type, as we see it originating from the instance_interface.go file and there weren't relevant fields in the Instance struct there. However, we do notice that type api.InstanceState has what we are looking for so we are wondering if there is any way to use instance.Instance to get api.InstanceState?

You should be able to parse limits.cpu for the number of CPUs used by the VM and limits.memory for the amount of memory it's allowed to consume (use ParseByteSizeString).

That's going to be fine for now and will avoid migrating away an instance which is just booting up but will soon consume a lot more CPU/memory.

Another question we have is if you think our current way of calculating the effective score per server is good for the score balancing calculation we will do per instance? We are currently doing some division in calculating memory ((totalRAM - freeRAM) / totalRAM) used and CPU usage (numProc / numCores) and we fear it might be overcomplicating things a bit?

Should be fine for the memory percentage. For CPU, I'd do load-average / total CPU count, but we don't currently have the load-average information exposed so that's going to be a bit difficult. I guess for now you can proceed with just looking at memory and I'll be opening another issue to track adding system load to the resources API so we can make use of that.

Lastly, for our migration - we are planning to use a logic similar to migrateFunc from clusterStateNodePost to do actual migration - do you think this is sufficient in dealing with the scheduler validation issue you mentioned or if we need to explicitly do so? We have looked in instance_placement.go and scheduler.go for this but we don't think this holds relevancy here.

That should be fine. As we only do live-migration of VMs here, the logic should be a bit simpler than all the cases we have with evacuation.

We tried to implement this in the forked repo https://github.com/sophiezhangg/cs360v-incus/tree/cs360v-automigration in the func autoClusterRebalanceTask in api_cluster.go and are wondering if you have any suggestions / advice for us? We would greatly appreciate it!

General structure looks good.

lucaszcai commented 6 months ago

Thanks for the response! I just wanted to follow up on our first question, where we were wondering how we should best get the api.Instance structs for getting the instance resources. Currently, we are getting a list of dbCluster.Instance from dbCluster.GetInstances and converting into a list of instance.Instance using LoadByProjectAndName, but the ParseByteSizeString method that you are suggesting requires an api.InstanceFull as part of the argument.

We are unsure of how to convert from an instance.Instance to an api.InstanceFull, and we were wondering if you had any suggestions. Perhaps we should be retrieving the instances in a different way?

stgraber commented 6 months ago

You should be able to do inst.ExpandedConfig()["limits.memory"] to get the memory limit from your instance.Instance

lxc / incus

Automatic live-migration to balance load on cluster #485