vatesfr / xen-orchestra

The global orchestration solution to manage and backup XCP-ng and XenServer.
https://xen-orchestra.com
Other
775 stars 262 forks source link

[Backup NG] Connection issue causes stalled backup #3930

Closed jcharaoui closed 5 years ago

jcharaoui commented 5 years ago

Context

Expected behavior

Stable connection.

Current behavior

Since upgrading to XO 5.31.0, I'm experiencing intermittent connection issues between xo-server and XCP-ng.

I first noticed this when tagging new VMs. After typing the tag and pressing Enter, the input box would clear as expected, but the tag never appeared on the UI. I was able to confirm that the tag was indeed applied via XenAdmin. It was only after disconnecting and reconnecting to the server (via Settings, Servers), that the tag finally appeared. After reconnection, tagging worked as usual.

The more serious symptom is related to backups. The first backup run after upgrading to 5.31.0 didn't go well. After backing up 4 VMs successfully, xo-server stalled here (concurrency is set to 2):

xo:xapi - [DEBUG] Deleting VM [XO Backup BackupJob] VmName1
xo:xapi - [DEBUG] Deleting VM [XO Backup BackupJob] VmName2

On the UI, the transfer and merge operation are successful for these 2 VMs. It just seems stuck trying to delete these snapshots. Logging in via XenAdmin, I can see the two VM snapshots still exist.. Also, if I try again tagging VMs, the problem described above is back.

I can confirm there's no issue with the XCP-ng pool or servers, everything else is operating normally.

I've been running 5.30 without any issues for several weeks, and the problems described here appeared on day 1 after the upgrade to 5.31.

Danp2 commented 5 years ago

This sounds similar to #3875. Can you try building a new XO VM, restore your config, and then retest?

jcharaoui commented 5 years ago

@Danp2 Thanks for the link. Unfortunately I don't think the two are linked, since I'm not using a NFS remote, but rather a local filesystem one.

Danp2 commented 5 years ago

Understood. However, your description of "loss of connection" sounded similar to my experience. Thus I thought you should at least try with a new XO VM to see if the issue continues to occur.

ronivay commented 5 years ago

I’ve seen similar backup issue a few times after network outage from my xen-orchestra side. Backup (delta) has finished few new snapshots and then it just doesn’t do anything until i restart xo-server for it to cancel. Didn’t really write down version details first time this happened but certainly older than 5.30. I haven’t been able to reproduce the issue even with breaking the connection intentionally so needs some more investigation what is the situation this actually happens. Haven’t seen any errors in the logs so far and some tasks (the snapshots) from the job finish before hanging, so it’s not as simple as the connection hasn’t restored or anything.

In my case i’ve had a network outage during the day/evening and the backup job next night has failed to finish.

Don’t really have anything useful to give here other than it seems you’re not the only one whos had similar weirdness.

I also have local storage for backups.

ronivay commented 5 years ago

This happened once again. Had a short network outage (few minutes) around an hour before the backup job was going to happen. I'm running 5.34.0 currently. I can see in the xo-server logs that at least it tried to snapshot couple of the VM's from backup job, but they actually never completed.

2019-02-13T02:00:00.010Z - xo:xapi - [DEBUG] Snapshotting VM vmname1 [XO Backup Delta Backup] vmname1 2019-02-13T02:00:00.013Z - xo:xapi - [DEBUG] Snapshotting VM vmname2 as [XO Backup Delta Backup] vmname2

No further information in the logs. Backup job includes 21 VM's. Job and all tasks are currently in "started" state and not doing anything.

What i noticed is that at least some of the VM information is not updating. I tried to rename one VM through Xen Orchestra, new name didn't come up in Xen Orchestra but it did change (checked from the host with xe vm-list), so Xen Orhchestra is in some weird state now and I can't see any info from logs.

After hitting disconnect/connect for the host the VM name (i just changed) was showing correctly. Backup didn't recover though, so had to restart xo-server once again for it to be interrupted.

One of the biggest downsides of this is that one does not get any notification from a failed backup when it's stuck. I have this covered with my own monitoring though which keeps and eye on successful backup reports, but for some the worst case scenario is that a backup hasn't been running for days or even weeks it it's stuck. Would be nice to have a time limit how long a backup can be in running state before failing so that notification would be sent if something like this happens.

olivierlambert commented 5 years ago

Thanks for the feedback, XO team will try to reproduce this behavior!

jcharaoui commented 5 years ago

This appears to be fixed since xo-server 5.36.3, as I haven't hit the problem again since upgrading.

julien-f commented 5 years ago

Great thank you for your feedback :slightly_smiling_face: