victronenergy / gui-v2

Other
29 stars 10 forks source link

10.g MQTT: Handle connection failure in gui-v2 #214

Closed DanielMcInnes closed 10 months ago

chriadam commented 1 year ago

We need a few things here:

1) automatic reconnection on disconnect at runtime, using exponential backoff to avoid accidentally ddosing the broker. https://github.com/victronenergy/veutil/pull/2 should implement this - we need to test it more carefully to ensure it works. 2) failure to connect at startup needs to be handled more carefully. e.g. show both error code and error string, in a variety of error cases: a) unable to connect to broker, b) able to connect but unable to receive portal id (after some timeout) etc. 3) in splash screen, need some manual "try again" button. 4) in settings screen, need some "connection status" to be communicated, and manual reconnect button. Maybe a notification?

chriadam commented 1 year ago

The first two are done.

We actually should never need a manual reconnect button, as we never "give up" completely any more, we always try to reconnect.

When we are disconnected, we go back to the splash screen, so communicating status in settings or notifications screen probably makes no sense. I wonder if we need to revisit this in future (e.g. don't return to splash screen in case of disconnect)?

blammit commented 1 year ago

If the app disconnects and reconnects, there is this output:

VM100:5 WebSocket is already in CLOSING or CLOSED state.
VM100:5 WebSocket is already in CLOSING or CLOSED state.
VM100:5 WebSocket is already in CLOSING or CLOSED state.
VM100:5 WebSocket is already in CLOSING or CLOSED state.
qtloader.js:382 MQTT client error: QMqttClient::ServerUnavailable
qtloader.js:382 BackendConnection state: Victron::VenusOS::BackendConnection::Disconnected
qtloader.js:382 BackendConnection state: Victron::VenusOS::BackendConnection::Reconnecting
qtloader.js:382 MQTT client error: QMqttClient::NoError
qtloader.js:382 BackendConnection state: Victron::VenusOS::BackendConnection::Connecting
qtloader.js:382 BackendConnection state: Victron::VenusOS::BackendConnection::Connected
qtloader.js:382 BackendConnection state: Victron::VenusOS::BackendConnection::Initializing
qtloader.js:382 BackendConnection state: Victron::VenusOS::BackendConnection::Ready
qtloader.js:382 qml: Data manager source is already set to qrc:/data/mqtt/MqttDataManager.qml cannot be changed after initialization

Perhaps the Data manager source could be set to undefined if the disconnection is broken, and then set to a valid value again on reconnection.

blammit commented 1 year ago

Tested today on main, if I run on desktop and connect directly to cerbo with --mqtt option.

If I turn off the wifi, after maybe 1 minute 20 seconds, gui-v2 returns to the splash screen. Then it will try to reconnect, but reconnection does not work, and the UI stays like this:

image

chriadam commented 1 year ago

https://github.com/victronenergy/gui-v2/pull/495 fixes the above issue.

blammit commented 1 year ago

In Thiemo's testing, it did reconnect after waiting for some time. So perhaps we can just shorten the heartbeat detection so it detects the disconnection and reconnects a bit sooner.

chriadam commented 1 year ago

We cannot necessarily use heartbeat detection for this.

If we are connected to VRM (and that connection is active/valid) but the device has disconnected from VRM (two-way), then we will NOT receive heartbeats from the device via the VRM connection (because VRM cannot forward what it does not receive). So, in this case, lack of heartbeat doesn't necessarily mean lack of connection to the VRM broker, for example (athough it CAN mean that).

We could add logic to detect lack of heartbeat for the direct-to-device LAN connection case - but not sure how useful that case is (since it won't affect the CerboGX device UI).

chriadam commented 1 year ago

To be precise, my understanding is that the Heartbeat timer is mostly so that we can set the HeartbeatState (i.e. Active or Inactive) to give some indicator to the user whether the values being shown in the UI are "live" or "stale". We don't currently have designs from Serj for how this should be indicated, as far as I know.

chriadam commented 1 year ago

In Thiemo's testing, it did reconnect after waiting for some time. So perhaps we can just shorten the heartbeat detection so it detects the disconnection and reconnects a bit sooner.

Thiemo was probably using WebSocket connection? With WebSockets, Qt's stack eventually notices the problem and does the teardown and reconnection itself. But it takes a long time.

chriadam commented 1 year ago

Wiebe helped to investigate the "after reconnection, not receiving messages from the broker" issue, and he fixed a problem in FlashMQ (see https://github.com/halfgaar/FlashMQ/commit/f43245ef851055c19435d6745d37fc2bb6c08a1b). So, my workaround should not be needed any more, I think. But let's wait until the updated version of FlashMQ is included in a VenusOS firmware image we can test with our CerboGX device.

chriadam commented 1 year ago

Still need to follow up on this, since QTBUG-118820 needs to be worked around also.

chriadam commented 10 months ago

We merged https://github.com/victronenergy/gui-v2/pull/585 which fixes it. Obviously, the fix isn't perfect because QtMqtt can be slow telling clients that they've disconnected, so it can take some time for us to react to the disconnection. But the disconnection IS eventually noticed and handled properly, now.

Closing task. Can reopen if there is some specific issue.