vatesfr / xen-orchestra

The global orchestration solution to manage and backup XCP-ng and XenServer.
https://xen-orchestra.com
Other
767 stars 262 forks source link

Try to automatically reconnect host #3218

Closed Jetserver closed 5 years ago

Jetserver commented 6 years ago

Context

Expected behavior

Try to automatically re-connect a physical host, send a notification alert if connect fails after X times. Wait X hours between each retries.

Current behavior

When dealing with 100+ hosts, in several geo locations through one XO, we sometimes encounter communication issues causing the physical hosts to disconnect.

XO will not try to automatically reconnect, causing the host to stay off line until human interaction.

This is could cause serious issues if there are jobs configured on that host, the jobs will not apply to the host. Sometimes you will discover this when it's too late... (no vm backups for specific host).

hyzermon commented 6 years ago

This is a critical feature for a robust backup solution. Which I want Xen Orchestra to be for our organization.

olivierlambert commented 6 years ago

@julien-f I raised the severity of this. IDK if it's very complicated or not, but it should be at least planned in some future release :+1:

julien-f commented 6 years ago
  1. [ ] split connected and enabled status
  2. [ ] automatically try to reconnect on backup job
julien-f commented 5 years ago

It's getting better but we are not there yet.

hkraal commented 5 years ago

@julien-f @olivierlambert how is this coming along?

mas90 commented 5 years ago

Any progress? I'm a little surprised that automatic reconnection doesn't exist yet -- it means that Xen Orchestra breaks for our users and requires manual intervention by an admin every few days...

opsben commented 5 years ago

This is looking like a critical requirement for us too. We are too frequently having to manually reconnect pools at remote locations. Makes it hard to sell to other areas of our business as a stable platform upon which to host production services.

olivierlambert commented 5 years ago

I think it exists but it's not optimal, the main difficulty is that there isn't any real "connected" status to the XAPI. It's just a bunch of events (coming from XAPI) so we can't really guess when it's interrupted. A way to do so is to have a timeout if we don't get events from the last minute or so. But this leads to non-trivial consequences, @julien-f is more able to explain this than me (but I wanted to show it's not "trivial")

mas90 commented 5 years ago

The /settings/servers page seems to know whether a host is connected, at least -- when we lose a connection to a host, 'Connected' on that page changes to 'Disconnected'. xo-cli also seems to know whether a server is connected.

Maybe all I need is a cron job which runs xo-cli server.getAll and runs xo-cli server.connect ... on each disconnected server.

(Though occasionally XO fails to reconnect to a disconnected server until I restart XO, which is probably a different bug... That seems not to have happened recently though, so maybe that bug has been fixed already :-) )

julien-f commented 5 years ago

This should be completely fixed by 485b8fe99