Closed dupondje closed 1 year ago
Most likely this was still during connection issues. And seems like it does only retry to connect 3 times or so? That could explain it.
Had this issue again today, and decided to troubleshoot it further.
Connection failed:
2023-05-16 01:08:00,233+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-18) [] Host 'xxxxx' is not responding. It will stay in Connecting state for a grace period of 78 seconds and after that an attempt to fence the host will be issued.
2023-05-16 01:08:00,237+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-31) [] Unable to GetStats: VDSNetworkException: VDSGenericException: VDSNetworkException: Connection timeout for host 'x.x.x.240', last response arrived 17047 ms ago.
It reconnects again:
2023-05-16 01:08:01,382+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /x.x.x.240
2023-05-16 01:08:01,382+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connected to /x.x.x.240:54321
But then RefreshCapabilities fails:
2023-05-16 01:08:08,953+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-71) [] Unable to RefreshCapabilities: ClientConnectionException: SSL session is invalid
And this seems to cause the HostConnectionRefresher to stop Refreshing the VDS, and causing the host to stay in Connecting state forever. This is because the EventPublisher stopped in the jsonrpc?
The HostMonitoringWatch dog confirms:
2023-05-16 01:41:20,177+02 WARN [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoringWatchdog] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Monitoring not executed for the host x.x.x.240 [614c7aea-faca-4b2c-a521-6fb8e564fa56] for 2007762ms
Maybe @pkliczewski has some idea?
We had some connection issue between the ovirt-engine and the hosts.
Now we noticed the hosts didn't got reconnected after resolving the issue. Checking the logs show the following:
It might be related to https://github.com/oVirt/vdsm-jsonrpc-java/pull/17 ?
After restarting the ovirt-engine it started working again. But would be cleaner if it would reconnect by itself :)
Thanks