Rolling pool update does not resume after reboot.

tuxpowered commented 5 months ago

Are you using XOA or XO from the sources?

XO from the sources

Which release channel?

None

Provide your commit number

0794a

Describe the bug

When performing a "Rolling Update" on an HA cluster, XO proceeds to migrate all VM's off the primary node to other nodes. (good). The primary node then issues a reboot, however when it comes back on line the other nodes in the HA cluster do not resume downloading and applying patches.

Error message

Text From Settings > Logs:

server.enable
{
  "id": "0bce7468-93e5-4376-93c1-c75082f8f436"
}
{
  "name": "ConnectTimeoutError",
  "code": "UND_ERR_CONNECT_TIMEOUT",
  "call": {
    "method": "session.login_with_password",
    "params": "* obfuscated *"
  },
  "message": "Connect Timeout Error",
  "stack": "ConnectTimeoutError: Connect Timeout Error
    at onConnectTimeout (/opt/xen-orchestra/node_modules/undici/lib/core/connect.js:190:24)
    at /opt/xen-orchestra/node_modules/undici/lib/core/connect.js:133:46
    at Immediate._onImmediate (/opt/xen-orchestra/node_modules/undici/lib/core/connect.js:174:9)
    at processImmediate (node:internal/timers:476:21)
    at process.callbackTrampoline (node:internal/async_hooks:128:17)"
}

pool.rollingUpdate
{
  "pool": "62d8471c-e515-0d7a-d77f-5ac38a945507"
}
{
  "message": "Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart",
  "name": "Error",
  "stack": "Error: Host 1f4b8cd7-e9da-414e-8558-8059a3165b98 took too long to restart
    at Xapi.rollingPoolReboot (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/pool.mjs:127:9)
    at Xapi.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xapi/mixins/patching.mjs:501:5)
    at XenServers.rollingPoolUpdate (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/xen-servers.mjs:689:5)
    at Xo.rollingUpdate (file:///opt/xen-orchestra/packages/xo-server/src/api/pool.mjs:231:3)
    at Api.#callApiMethod (file:///opt/xen-orchestra/packages/xo-server/src/xo-mixins/api.mjs:366:20)"
}

To reproduce

Go to 'Home > Pools > Select HA Pool'
Click on 'Patches > Rolling pool Update'
See error (non displayed review logs)

Expected behavior

On reboot of the primary node, the migration of VM's back should resume and the process should go on to the next pool and repeat

Screenshots

No response

Node

18.20.0

Hypervisor

8.2.1

Additional context

It appears that the HA Master properly has VM's migrated and patches applied first. Systems all have 10GB dedicated storage and 1GB interface for VM access and management.

Danp2 commented 5 months ago

commit number 0794a

You are about a month behind on updates. Also, have you seen the latest revisions to the documentation where it explains how to increase the timeout period? https://xen-orchestra.com/docs/manage_infrastructure.html#rolling-pool-updates-rpu

tuxpowered commented 5 months ago

Oh wow, that far behind already? Seems like it was just a few weeks ago I updated. Did not see the timeout update. I will update and review. It is odd because I have 2 clusters one updates fine np the other has an issue (just started testing the other cluter)

Danp2 commented 5 months ago

[Rolling Pool Update/Reboot] Use XO tasks for better reportability (PR #7578)

This was merged earlier today, which will make monitoring the RPU much easier.

b-Nollet commented 4 months ago

We've recently made some changes to the RPU, including a fix for a bug introduced by the release earlier this month. Can you update to the latest version and test if the problem is still present? (and provide us with the XO task logs)

vatesfr / xen-orchestra