ronivay / XenOrchestraInstallerUpdater

Xen Orchestra install/update script
GNU General Public License v3.0
1.2k stars 190 forks source link

XO server loses pool and hosts momentarily #234

Closed felibb closed 4 months ago

felibb commented 4 months ago

OS Version: Debian 11 Linux xo-ce 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31) x86_64 GNU/Linux Node.js version: v18.20.2 Yarn version: 1.22.19

Server specs 2 vCPU, 4GiB RAM

Issue Pool and hosts seem to disappear from the web UI every 3-6 min, only to reappear automagically after exactly 1 min. I was about 47 commits behind, but updated today to d3ab7, and I observed this on both older and latest versions.

This XO server has been created from scratch to replace the original one I messed up (by a typo, locking myself out). So I created a new XO and added existing pool by adding the master host server to it, as instructed in some guide I had read. I can perform basic operations on the pool otherwise, but this behavior is pretty annoying. Don't think it happened with the old XO, but I don't have much experience, this is my first xcp-ng cluster.

Installation logfile

xo-server logs this when it loses the pool (but nothing when the pool reappears):

May 06 15:01:01 xo-ce xo-server[37148]: _watchEvents TimeoutError: operation timed out
May 06 15:01:01 xo-ce xo-server[37148]:     at Promise.timeout (/opt/xo/xo-builds/xen-orchestra-202405061450/node_modules/promise-toolbox/timeout.js:11:16)
May 06 15:01:01 xo-ce xo-server[37148]:     at Xapi.apply (file:///opt/xo/xo-builds/xen-orchestra-202405061450/packages/xen-api/index.mjs:773:37)
May 06 15:01:01 xo-ce xo-server[37148]:     at Xapi._call (/opt/xo/xo-builds/xen-orchestra-202405061450/node_modules/limit-concurrency-decorator/src/index.js:85:24)
May 06 15:01:01 xo-ce xo-server[37148]:     at Xapi._watchEvents (file:///opt/xo/xo-builds/xen-orchestra-202405061450/packages/xen-api/index.mjs:1198:31) {
May 06 15:01:01 xo-ce xo-server[37148]:   call: {
May 06 15:01:01 xo-ce xo-server[37148]:     method: 'event.from',
May 06 15:01:01 xo-ce xo-server[37148]:     params: [ [Array], '00000000000063313749,00000000000062524648', 60.1 ]
May 06 15:01:01 xo-ce xo-server[37148]:   }
May 06 15:01:01 xo-ce xo-server[37148]: }
May 06 15:04:01 xo-ce xo-server[37148]: _watchEvents TimeoutError: operation timed out
May 06 15:04:01 xo-ce xo-server[37148]:     at Promise.timeout (/opt/xo/xo-builds/xen-orchestra-202405061450/node_modules/promise-toolbox/timeout.js:11:16)
May 06 15:04:01 xo-ce xo-server[37148]:     at Xapi.apply (file:///opt/xo/xo-builds/xen-orchestra-202405061450/packages/xen-api/index.mjs:773:37)
May 06 15:04:01 xo-ce xo-server[37148]:     at Xapi._call (/opt/xo/xo-builds/xen-orchestra-202405061450/node_modules/limit-concurrency-decorator/src/index.js:85:24)
May 06 15:04:01 xo-ce xo-server[37148]:     at Xapi._watchEvents (file:///opt/xo/xo-builds/xen-orchestra-202405061450/packages/xen-api/index.mjs:1198:31) {
May 06 15:04:01 xo-ce xo-server[37148]:   call: {
May 06 15:04:01 xo-ce xo-server[37148]:     method: 'event.from',
May 06 15:04:01 xo-ce xo-server[37148]:     params: [ [Array], '00000000000063314205,00000000000062524648', 60.1 ]
May 06 15:04:01 xo-ce xo-server[37148]:   }
May 06 15:04:01 xo-ce xo-server[37148]: }
felibb commented 4 months ago

Update: I got curious and, with some effort, recovered the original XO VM. After patching it to 771b0 I see the same strange behaviour, and similar log messages.

ronivay commented 4 months ago

Hi,

Well it reports a timeout so some sort of connectivity issue between XO and the host(s). I don't have much tips to proceed with unfortunately. Seems like an environment specific issue not related to XO directly.

felibb commented 4 months ago

Argh. I feared it would be something like that. No network changes that could explain this. All was working fine until last week. In fact I think it started after the server patch/update, the only significant change I did, but no errors in the server logs.

Some xcp-ng forum posts from 2023 talked about going to node.js v18 as a solution to a similar timeout issue, but I am already on v18. Might try asking there anyway.

vferrandobe commented 4 months ago

Hi. I can't provide much information. But I had a similar problem. After running the script yesterday and doing the upgrade, hosts started to misbehave. And after installing patches on the pool master, the other 2 hosts appeared as disabled. Only the master was able to run the VMs. After rolling back (to one moth old build) one of the hosts is enabled again. But the other one is showing the alert "Hardware-assisted virtualization. is not enabled..." But it hasn't been disabled. This host has been rebooted from ssh session without any change. After entering the BIOS to double check, it booted and inmediatelly enabled without any alert. Regards.

felibb commented 4 months ago

In the forum I was recommended to "try few other commits and see if the behavior change or not". @ronivay is there a good way to tell xo-install.sh which commit to pull from vatesfr/xen-orchestra repo? I would like to stick to using your tools, if possible.

Thread for anyone interested: https://xcp-ng.org/forum/topic/8984/xo-server-loses-pool-and-hosts-momentarily-timeout-error

ronivay commented 4 months ago

In the forum I was recommended to "try few other commits and see if the behavior change or not". @ronivay is there a good way to tell xo-install.sh which commit to pull from vatesfr/xen-orchestra repo? I would like to stick to using your tools, if possible.

Thread for anyone interested: https://xcp-ng.org/forum/topic/8984/xo-server-loses-pool-and-hosts-momentarily-timeout-error

Yes there is. Modify xo-install.cfg file's BRANCH variable to commit hash and run update with xo-install.sh and it'll get you directly to that specific commit. You can use long or short commit hash, eq. 9b9c71a80fd845b9726e0e89c82aa17f6aca2422 or 9b9c71a (latest commit as of now)

Videothek commented 4 months ago

@felibb did you manage to fix the issue?

I am having a related problem and also the same a described here.

I opened a ticket directly in the XO Github Repo but it was fixed because i was told it has something to do with the Installer: https://github.com/vatesfr/xen-orchestra/issues/7510

tldr: my backups fail everytime and i cant see why, the logs are confusing and as it seems they couldn´t really understand what is happening either.

So has anyone a fix for this, because i have to rerun my backups daily for the last 2 months wich is really annoying.

felibb commented 4 months ago

@Videothek if you read that forum thread you've seen my note about Debian upgrade and about networks. Just last night I decided to switch the network interface on (a pre-upgrade bullseye clone) XO to the LAN marked as Management (blue bubble in host network tab). Also installed latest XO commit. No timeouts for over 12h. Seems like Debian version doesn't matter, and the current code is fine, but somewhere along the line something changed that possibly made XO more sensitive to latency, maybe that undici library? Speculating here of course. Anyway, this issue can be closed, it probably doesn't even belong in this repo.