Open stgraber opened 8 months ago
One approach would be to only perform a single migration per cycle, while also preventing an instance from being moved again for a number of cycles since its last move.
What does "cycle" refer to in this context?
A cycle would be a scheduled task inside Incus basically.
The admin would basically instruct Incus to consider automatic balancing every 15min or every hour or every 3 hours, whatever makes sense based on their environment.
Hello!
My partners and I are currently studying Virtualization in UT Austin and would like to have more experience contributing to open-source repos like this - we are wondering if this issue is still open for solving? If so, we would love to take a chance and work on this because it seems very interesting to us! No worries if not!
Thank you so much!
Hello @stgraber!
My team just had discussion about how to approach this ticket and we are wondering if you mind providing tips, suggestions, and/or clarifications on our approach?
Here is our general logic:
instancePlacementRun
in the loop
function in task.go
as we saw that this function matches with your definition of "cycle" up above in your discussion with dontlaugh (but let us know if there are some misunderstanding) - we are thinking first we determine if the task is easily migratable (as defined in your specs - no local storage device etc. also if there is any stricter requirements on "easily migratable" we would love to take that into account as well) - and if it is, we will call instancePlacementRun
- we are assuming there is also a function that can let us know the local storage / local devices information for a particular task - also, we are thinking of adding a bool field in the Task struct to give a user the power to turn on / off live migration, as well as a integer value called coolDown
to mitigate flip-flopping between servers. Now every time loop
is called, we decrement coolDown
by 1 - when it is 0 and the user set automatic live migration to true, we will execute the above logic.Does this sound like we are on the right track?
Thank you so much for your help!
Hey @stgraber! We just wanted to follow up to see if you had any questions or feedback on our approach before we started implementing it.
I don't think it really makes sense for this to be a scriptlet. Scriptlets are there for cases where we want the administrator to have a way to customize our internal logic, like is done with the placement logic.
Instead here, what we need really is:
cluster.rebalance.frequency
=> How often to consider re-balancing thingscluster.rebalance.threshold
=> Load difference between most and least busy server needed to trigger a migration (default 20%)cluster.rebalance.cooldown
=> Amount of time during which an instance will not be moved again (default 1h)cluster.rebalance.batch
=> Maximum number of instances to move during one re-balancing runI'd recommend you start by doing the paperwork stuff, so basically:
api: cluster_rebalance
(changes to doc/api-extensions.md and internal/version/api.go)incusd/cluster/config: Add cluster re-balance configuration keys
(changes to internal/server/cluster/config/config.go)doc: Update configs
(result of make update-metadata
)incusd: Add cluster rebalance task
(look at autoHealClusterTask
for an example)That last commit is going to be the big one as far as logic goes, but you can slowly grow it as you go, starting with just logging something to say the task would run, then grow that to include details about all servers and load, to their instances and eventually what moves would happen.
There are a few things to be careful about:
Hello @stgraber!
My group have been working on this issue and we have gotten to the section where we are considering what instances to migrate and explicitly migrating them. First, we check if an instance from dbCluster instances is migratable by using canMigrate() == "live-migrate"
, if so, we append this instance to a list that is type instance.Instance
. However, we are having a bit of trouble seeing how to get the resources the instance is consuming with this type, as we see it originating from the instance_interface.go
file and there weren't relevant fields in the Instance
struct there. However, we do notice that type api.InstanceState
has what we are looking for so we are wondering if there is any way to use instance.Instance
to get api.InstanceState
?
Another question we have is if you think our current way of calculating the effective score per server is good for the score balancing calculation we will do per instance? We are currently doing some division in calculating memory ((totalRAM - freeRAM) / totalRAM) used and CPU usage (numProc / numCores) and we fear it might be overcomplicating things a bit?
Lastly, for our migration - we are planning to use a logic similar to migrateFunc
from clusterStateNodePost
to do actual migration - do you think this is sufficient in dealing with the scheduler validation issue you mentioned or if we need to explicitly do so? We have looked in instance_placement.go
and scheduler.go
for this but we don't think this holds relevancy here.
We tried to implement this in the forked repo https://github.com/sophiezhangg/cs360v-incus/tree/cs360v-automigration
in the func autoClusterRebalanceTask
in api_cluster.go
and are wondering if you have any suggestions / advice for us? We would greatly appreciate it!
Thank you so much for your help!
My group have been working on this issue and we have gotten to the section where we are considering what instances to migrate and explicitly migrating them. First, we check if an instance from dbCluster instances is migratable by using
canMigrate() == "live-migrate"
, if so, we append this instance to a list that is typeinstance.Instance
. However, we are having a bit of trouble seeing how to get the resources the instance is consuming with this type, as we see it originating from theinstance_interface.go
file and there weren't relevant fields in theInstance
struct there. However, we do notice that typeapi.InstanceState
has what we are looking for so we are wondering if there is any way to useinstance.Instance
to getapi.InstanceState
?
You should be able to parse limits.cpu
for the number of CPUs used by the VM and limits.memory
for the amount of memory it's allowed to consume (use ParseByteSizeString
).
That's going to be fine for now and will avoid migrating away an instance which is just booting up but will soon consume a lot more CPU/memory.
Another question we have is if you think our current way of calculating the effective score per server is good for the score balancing calculation we will do per instance? We are currently doing some division in calculating memory ((totalRAM - freeRAM) / totalRAM) used and CPU usage (numProc / numCores) and we fear it might be overcomplicating things a bit?
Should be fine for the memory percentage. For CPU, I'd do load-average / total CPU count, but we don't currently have the load-average information exposed so that's going to be a bit difficult. I guess for now you can proceed with just looking at memory and I'll be opening another issue to track adding system load to the resources API so we can make use of that.
Lastly, for our migration - we are planning to use a logic similar to
migrateFunc
fromclusterStateNodePost
to do actual migration - do you think this is sufficient in dealing with the scheduler validation issue you mentioned or if we need to explicitly do so? We have looked ininstance_placement.go
andscheduler.go
for this but we don't think this holds relevancy here.
That should be fine. As we only do live-migration of VMs here, the logic should be a bit simpler than all the cases we have with evacuation.
We tried to implement this in the forked repo https://github.com/sophiezhangg/cs360v-incus/tree/cs360v-automigration in the
func autoClusterRebalanceTask
inapi_cluster.go
and are wondering if you have any suggestions / advice for us? We would greatly appreciate it!
General structure looks good.
Thanks for the response! I just wanted to follow up on our first question, where we were wondering how we should best get the api.Instance
structs for getting the instance resources. Currently, we are getting a list of dbCluster.Instance
from dbCluster.GetInstances
and converting into a list of instance.Instance
using LoadByProjectAndName
, but the ParseByteSizeString
method that you are suggesting requires an api.InstanceFull
as part of the argument.
We are unsure of how to convert from an instance.Instance
to an api.InstanceFull
, and we were wondering if you had any suggestions. Perhaps we should be retrieving the instances in a different way?
You should be able to do inst.ExpandedConfig()["limits.memory"]
to get the memory limit from your instance.Instance
When running a cluster with one or more virtual-machines that's capable of being live-migrated across the cluster, we should be able to use that to better spread the load across a cluster by evaluating server load across the cluster and deciding whether to automatically move some workloads to re-balance things.
We already have a lot of the right pieces in place:
We will need to think through ways to avoid instances flip-flopping between servers as well as ways to mitigate the migration itself causing significant load difference on both the source and target server.
One approach would be to only perform a single migration per cycle, while also preventing an instance from being moved again for a number of cycles since its last move.
Ideally we'd be leveraging calls to our existing scheduler to find new locations for existing instances, only considering instances that can be easily live-migrated (no local storage, no local devices, ...).