gyptazy / ProxLB

ProxLB - (Re)Balance VM Workloads Across Nodes in Proxmox Clusters. A Load Balancer for Proxmox - and more!
https://proxlb.de
GNU General Public License v3.0
184 stars 8 forks source link

Cannot get it to work - TimeoutError: The read operation timed out #91

Open xSaKeNx opened 1 week ago

xSaKeNx commented 1 week ago

Cannot get it to work. its getting the Data from vms and Node but then ### fails (same error in dry run and normal run):

4> ProxLB: Warning: [node-update-statistics]: Node Node is overprovisioned for disk by 101%.
<4> ProxLB: Warning: [node-update-statistics]: Node Node is overprovisioned for disk by 151%.
<4> ProxLB: Warning: [node-update-statistics]: Node Node is overprovisioned for disk by 202%.
<6> ProxLB: Info: [node-update-statistics]: Updated node resource assignments by all VMs.
Traceback (most recent call last):
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/connection.py", line 507, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/usr/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/socket.py", line 707, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/ssl.py", line 1252, in recv_into
    return self.read(nbytes, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/ssl.py", line 1104, in read
    return self._sslobj.read(len, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 538, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 369, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='pvehost(ipv6-dns)', port=8006): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/proxlb", line 1505, in <module>
    main()
  File "/usr/bin/proxlb", line 1480, in main
    storage_statistics = get_storage_statistics(api_object)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/bin/proxlb", line 687, in get_storage_statistics
    for storage in api_object.nodes(node['node']).storage.get():
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/proxmoxer/core.py", line 167, in get
    return self(args)._request("GET", params=params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/proxmoxer/core.py", line 142, in _request
    resp = self._store["session"].request(method, url, data=data, params=params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/proxmoxer/backends/https.py", line 232, in request
    return super().request(
           ^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/src/ProxLB/.venv/lib/python3.12/site-packages/requests/adapters.py", line 713, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='pvehost(ipv6-dns)', port=8006): Read timed out. (read timeout=5)

This is my proxlb.conf - nothing special only added Comments for easier changes


enable: 1
# Defines the balancing method (default: memory) where you can use memory, disk or cpu.
mode: used
# Rebalance by used resources (efficiency) or assigned (avoid overprovisioning) resources. (default: used)
type: vm
# Rebalance only vm (virtual machines), ct (containers) or all (virtual machines & containers). (default: vm)
mode_option: percent
# Rebalance by node's resources in bytes or percent. (default: bytes)
balanciness: 10
# Value of the percentage of lowest and highest resource consumption on nodes may differ before rebalancing. (default: 10)
parallel_migrations: 0

# Defines if migrations should be done parallely or sequentially. (default: 1)
# ignore_nodes:
# Defines a comma separated list of nodes to exclude.
ignore_vms:  vm1*, vm2*
# Defines a comma separated list of VMs to exclude. (* as suffix wildcard or tags are also supported)

[storage_balancing]
enable: 0
# Enables storage balancing. (default: 0)
#balanciness: 10 # Value of the percentage of lowest and highest storage consumption may differ before rebalancing. (default: 10)
#parallel_migrations: 1 # Defines if migrations should be done parallely or sequentially. (default: 1)

[update_service]
enable: 0
# Enables the automated update service (rolling updates). (default: 0)

[api]
enable: 0
# Enables the ProxLB API.

[service]
master_only: 0
# Defines is this should only be performed (1) on the cluster master node or not (0). (default: 0)
daemon: 0
# Run as a daemon (1) or one-shot (0). (default: 1)
schedule: 24
# Hours to rebalance in hours. (default: 24)
log_verbosity: INFO
# Defines the log level (default: CRITICAL) where you can use INFO, WARN or CRITICAL
config_version: 3
# Defines the current config version schema for ProxLB```
Running on newest release (cloned yesterday)
gyptazy commented 1 week ago

Hey @xSaKeNx,

thanks for reporting. Looks like the VM objects couldn't be fetched in the required time where it expects to get all VM information within 5 seconds and then times out.

How many VMs are you running and how is your cluster performing? ProxLB needs to fetch the required information from for all VMs to process the next steps and calculations. Sure, the timeout could be increased but this would probably result in issues when migrating the VMs.

Can you please share more information regarding the cluster's utilization and the VM count?

Thanks, gyptazy

xSaKeNx commented 1 week ago

Hey thank you for your quick response, VM count is about 103 - 5 Offline, 8 Templates and 8 runnning LXCs. Obviously I've filtered it with ignoring VMs but i guess thats not changing the fetch time. 5 Nodes all similar

CPU 6% of 640 CPU(s) Memory 47% 2.30 TiB of 4.90 TiB Storage 57% 77.18 TiB of 135.04 TiB

Having no performance issues whatsoever no problems with response time or loading etc. Maybe changing the permissions to only view assigned resource pool vms could help?

gyptazy commented 1 week ago

Hey @xSaKeNx,

that's really strange but in your attached logs you can see that upstream libraries are raising this issue and the timeouts.

I just added a new feature with PR #92 which is also already merged into main and makes the timeout configurable. The default is now 10 seconds and can be set in the config.

Would be great if you could give it a try.

Edit:

Having no performance issues whatsoever no problems with response time or loading etc. Maybe changing the permissions to only view assigned resource pool vms could help?

Just saw this - what kind of permissions are currently granted? ProxLB requires the permissions according to https://github.com/gyptazy/ProxLB/blob/main/docs/02_Configuration.md#required-roles. But if this wouldn't fit - it would throw a permission issue and not run into a timeout-

Thanks, gyptazy

xSaKeNx commented 6 days ago

Hey yes I have already seen it and had my permissions set accordingly even full permission everywhere. I also tried giving permission to \Nodes\ and to a defined Pool \pool\testpool\ to get less data didnt help either. Unfortunately even after commit i tried it out and still getting the timeout after 5 seconds so its somehow not working or not setting the timeout on the correct path. Error stays the same as above (urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='ipv6dns', port=8006): Read timed out. (read timeout=5)

gyptazy commented 6 days ago

Hey yes I have already seen it and had my permissions set accordingly even full permission everywhere. I also tried giving permission to \Nodes\ and to a defined Pool \pool\testpool\ to get less data didnt help either. Unfortunately even after commit i tried it out and still getting the timeout after 5 seconds so its somehow not working or not setting the timeout on the correct path. Error stays the same as above (urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='ipv6dns', port=8006): Read timed out. (read timeout=5)

Are you sure using the new version?

Can you please share the outputs of:

grep __version__ /bin/proxlb

nl /bin/proxlb | grep 281

where the last one should return:

   281          api_object = proxmoxer.ProxmoxAPI(proxmox_api_host, user=proxmox_api_user, password=proxmox_api_pass, verify_ssl=proxmox_api_ssl_v, timeout=int(proxmox_api_timeout))
xSaKeNx commented 6 days ago

hmm well i checked the code with the provided commit but seems like im not on it even trying to get to the release 1.0.4 branch i only get this output version = "1.0.3b" 281 sys.exit(2)

gyptazy commented 6 days ago

Oh, then you're even running on an older beta of 1.0.3. Where did you obtain this or how did you install it?

xSaKeNx commented 6 days ago

from git just a few hours ago - git clone. Ive also tried Downloading Zip but seems like the same version.

gyptazy commented 6 days ago

Hm, I have no idea what you're doing there. It should look like this when checking this out freshly:

% git clone https://github.com/gyptazy/ProxLB.git && grep __version__ ProxLB/proxlb
Cloning into 'ProxLB'...
remote: Enumerating objects: 423, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (46/46), done.
remote: Total 423 (delta 4), reused 1 (delta 1), pack-reused 376 (from 1)
Receiving objects: 100% (423/423), 167.23 KiB | 2.29 MiB/s, done.
Resolving deltas: 100% (228/228), done.
__version__        = "1.0.4"

I'm wondering how you even came to a beta version. Your currently used 1.0.3b is old and buggy. Guess, the current 1.0.4 solves your issues.

xSaKeNx commented 6 days ago

Ive got no idea but now atleast I got the right one. showing exactly your output but still getting read timeout after 5 seconds even setting it to 30