Feature: Balance based on assigned resources instead of current usage

daanbosch commented 2 months ago

Overview

For my use case, virtual machines (VMs) often exhibit bursty behavior, and moving them is not always feasible due to business constraints. Therefore, I request the ability to balance load based on the assigned CPU and memory resources instead of the current usage metrics.

Task

Implement functionality in Proxmox that allows load balancing to consider the assigned CPU and memory resources for VMs, rather than relying solely on current usage values.

Modify the load balancing algorithm to incorporate the assigned CPU and memory resources of VMs. Ensure the algorithm can dynamically allocate VMs to hosts based on these assigned resource values. Configuration Options:
Provide configuration settings to toggle between using current usage and assigned resource values for load balancing.

gyptazy commented 2 months ago

Hey @daanbosch,

thanks for this feature request. I will check how much changes are required to implement this and check if this is doable for release 1.0.0 or 1.1.0. Will update this request soon with more information.

Thanks, gyptazy

gyptazy commented 2 months ago

Hey @daanbosch

VM Rebalancing by Total Value

With the new param mode which can be defined in the config file, it can now be defined if rebalancing should be done by used (default) or total resources.

This is currently available in PR #19 and should be merged soon. It will take place with release 1.0.0.

@daanbosch Can you please give it a try and let me know if I fully understood your request for this feature? Thanks!

Cheers, gyptazy

daanbosch commented 2 months ago

Oh amazing! Going to test this right away!

daanbosch commented 2 months ago

Hmm the number I'm getting are pretty odd:

<6> ProxLB: Info: [logger]: Logger verbosity got updated to: INFO.
<4> ProxLB: Warning: [api-connection]: API connection does not verify SSL certificate.
<6> ProxLB: Info: [api-connection]: API connection succeeded to host: <redacted>.
<6> ProxLB: Info: [node-statistics]: Added node node2.
<6> ProxLB: Info: [node-statistics]: Added node node1.
<6> ProxLB: Info: [node-statistics]: Added node node3.
<6> ProxLB: Info: [node-statistics]: Created node statistics.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [vm-statistics]: Added vm testproxlb2.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [vm-statistics]: Added vm testproxlb3.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [vm-statistics]: Added vm testproxlb.
<6> ProxLB: Info: [vm-statistics]: Created VM statistics.
<6> ProxLB: Info: [rebalancing-calculator]: Rebalancing will be done for method: memory.
<6> ProxLB: Info: [rebalancing-calculator]: Rebalancing will be done by: total resources.
<6> ProxLB: Info: [rebalancing-calculator]: Balanciness is set to: 1.
<6> ProxLB: Info: [balancing-method-validation]]: Valid balancing method: memory
<6> ProxLB: Info: [balanciness-validation]: Rebalancing is for memory is not needed. Highest usage: 98% | Lowest usage: 98
<6> ProxLB: Info: [rebalancing-calculator]: Balancing calculations done.
<6> ProxLB: Info: [rebalancing-executor]: Starting dry-run to rebalance vms to their new nodes.
<6> ProxLB: Info: [rebalancing-executor]: No rebalancing needed according to the defined balanciness.
No rebalancing needed according to the defined balanciness.
<6> ProxLB: Info: [post-validations]: All post-validations succeeded.
<6> ProxLB: Info: [daemon]: Not running in daemon mode. Quitting.

Settings:

[proxmox]
api_host: <redacted>
api_user: <redacted>
api_pass: <redacted>
verify_ssl: 0
[balancing]
method: memory
ignore_nodes: none
ignore_vms: none
balanciness: 1
mode: total
[service]
daemon: 0
schedule: 24
log_verbosity: INFO

Also tried it with CPU:

<6> ProxLB: Info: [logger]: Logger verbosity got updated to: INFO.
<4> ProxLB: Warning: [api-connection]: API connection does not verify SSL certificate.
<6> ProxLB: Info: [api-connection]: API connection succeeded to host: <redacted>.
<6> ProxLB: Info: [node-statistics]: Added node node3.
<6> ProxLB: Info: [node-statistics]: Added node node1.
<6> ProxLB: Info: [node-statistics]: Added node node2.
<6> ProxLB: Info: [node-statistics]: Created node statistics.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [vm-statistics]: Added vm testproxlb.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [vm-statistics]: Added vm testproxlb3.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM comment from API.
<6> ProxLB: Info: [vm-statistics]: Added vm testproxlb2.
<6> ProxLB: Info: [vm-statistics]: Created VM statistics.
<6> ProxLB: Info: [rebalancing-calculator]: Rebalancing will be done for method: cpu.
<6> ProxLB: Info: [rebalancing-calculator]: Rebalancing will be done by: total resources.
<6> ProxLB: Info: [rebalancing-calculator]: Balanciness is set to: 1.
<6> ProxLB: Info: [balancing-method-validation]]: Valid balancing method: cpu
<6> ProxLB: Info: [balanciness-validation]: Rebalancing is for cpu is not needed. Highest usage: 100% | Lowest usage: 100
<6> ProxLB: Info: [rebalancing-calculator]: Balancing calculations done.
<6> ProxLB: Info: [rebalancing-executor]: Starting dry-run to rebalance vms to their new nodes.
<6> ProxLB: Info: [rebalancing-executor]: No rebalancing needed according to the defined balanciness.
No rebalancing needed according to the defined balanciness.
<6> ProxLB: Info: [post-validations]: All post-validations succeeded.
<6> ProxLB: Info: [daemon]: Not running in daemon mode. Quitting.

VM's:

+-----------+------+-------------+---------+-------+--------+---------+-------+--------+-----------+------------+------------+----------------+--------------+------------+------+---------+---------+-------------+------+
| id        | type | cgroup-mode | content |   cpu |   disk | hastate | level | maxcpu |   maxdisk |     maxmem |        mem | name           | node         | plugintype | pool | status  | storage |      uptime | vmid |
+===========+======+=============+=========+=======+========+=========+=======+========+===========+============+============+================+==============+============+======+=========+=========+=============+======+
| qemu/100  | qemu |             |         | 0.04% | 0.00 B |         |       |     10 |  2.20 GiB | 195.78 GiB | 819.39 MiB | testproxlb     | node1        |            |      | running |         | 22h 44m 37s |  100 |
+-----------+------+-------------+---------+-------+--------+---------+-------+--------+-----------+------------+------------+----------------+--------------+------------+------+---------+---------+-------------+------+
| qemu/101  | qemu |             |         | 0.12% | 0.00 B |         |       |      5 | 50.00 GiB | 195.78 GiB | 772.17 MiB | testproxlb2    | node1        |            |      | running |         |      4m 38s |  101 |
+-----------+------+-------------+---------+-------+--------+---------+-------+--------+-----------+------------+------------+----------------+--------------+------------+------+---------+---------+-------------+------+
| qemu/102  | qemu |             |         | 0.03% | 0.00 B |         |       |     12 |  2.20 GiB | 195.78 GiB | 836.94 MiB | testproxlb3    | node1        |            |      | running |         | 22h 44m 30s |  102 |
+-----------+------+-------------+---------+-------+--------+---------+-------+--------+-----------+------------+------------+----------------+--------------+------------+------+---------+---------+-------------+------+
| qemu/9000 | qemu |             |         | 0.00% | 0.00 B |         |       |      1 |  2.20 GiB |   2.00 GiB |     0.00 B | focal-template | node1        |            |      | stopped |         |          0s | 9000 |
+-----------+------+-------------+---------+-------+--------+---------+-------+--------+-----------+------------+------------+----------------+--------------+------------+------+---------+---------+-------------+------+

gyptazy commented 2 months ago

Thanks, I just pushed a fix. Can you give it a try, please?

It does not make any sense to validate the current resources for balanciness when using total values: https://github.com/gyptazy/ProxLB/compare/ef60124c286d9e346690b45650700677d79a5b31..f14b94f7584377675022d740a98279a4e777d42f

However, this should work but still requires additional changes. Current disadvantage of this one is, that it will rebalance almost always the VMs.

I need to adjust the test cluster and integrate further changes.

daanbosch commented 2 months ago

Hmm now it wants to move every vm to node2 based on cpu (testproxlb2 is already on node 2) in this scenario.

            VM   Current Node   Rebalanced Node
    testproxlb   node1      node2
   testproxlb3   node1      node2

For the memory run:

            VM   Current Node   Rebalanced Node
   testproxlb2   node2          node1
   testproxlb3   node1          node2
    testproxlb   node1          node3

This would be correct, however it does not really make sense to swap testproxlb2 and testproxlb3.

However it seems to be going in the right direction! Thanks!

gyptazy commented 2 months ago

This would be correct, however it does not really make sense to swap testproxlb2 and testproxlb3. However it seems to be going in the right direction! Thanks!

Yeah, that was what I meant with:

However, this should work but still requires additional changes. Current disadvantage of this one is, that it will rebalance almost always the VMs.

I'll probably have a look at this on Monday.

gyptazy commented 2 months ago

Just had a look at it this morning and decided to integrate this in a proper way which requires more restructuring in the code than previously assumed with more validations because it also already killed a node in my cluster in my test ;)

I'm already working on that and will push it when it is ready in a usable way.

gyptazy commented 2 months ago

Hey @daanbosch,

maybe you can give https://github.com/gyptazy/ProxLB/pull/23 a try by time. Currently, there's still a small issue included, where it might need to do an initial rebalance and works right away in the second run. This is something I'm still looking into...

Thanks, gyptazy

daanbosch commented 2 months ago

Hi @gyptazy,

I just tested #23 and it works fine for me. There are indeed some small things that make it not the quickest path to get the desired balance. However. It's already a great tool in the current state!

gyptazy commented 1 month ago

Hey @daanbosch,

I just tested #23 and it works fine for me. There are indeed some small things that make it not the quickest path to get the desired balance. However. It's already a great tool in the current state!

Happy to hear! I'll add some more improvements asap so that this should also immediately work in the first run. I encountered additional issues with the API and I can only rely on the (updated) information in the API to recalculate the best placement for VMs. You might also see a race condition, when retriggering that command too fast that you get inconsistent/outdated data from the API. While ProxLB is working stateless, this is an issue (maybe solvable by writing some state files in the filesystem, because I really like to avoid using any databases for this small service).

Cheers, gyptazy

gyptazy / ProxLB