openhab / openhab-addons

Add-ons for openHAB
https://www.openhab.org/
Eclipse Public License 2.0
1.88k stars 3.59k forks source link

[homematic] Automatic reconnection after CCU to OpenHAB connection interruption #11427

Open Joerg-Dr opened 3 years ago

Joerg-Dr commented 3 years ago

This feature request is a suggestion to enhance the re-connection of disconnected devices when the Homematic CCU had been rebooted or had been offline for any other reason.

It is related to this bug report: #8808

More detailed: After connection loss of the Homematic binding to the CCU, some of the Homematic devices (mainly the Homematic IP devices) are not recognized in time and therefore stay in status "Error".

There exists a solution for this problem available from binding version 3.2.0 by extending the waiting time for devices to become available.

But after the total waiting time had expired, there will be no more retries and therefore non-responding Homematic devices could remain in status "Error".

The suggestion is to implement an automatic connection retry of those "failed" devices every xx minutes, so that the system will never give up to establish a connection.

My environment: Original Homematic CCU2 with several Homematic and Homematic IP devices. OpenHAB 3.1.0 release build running on Raspberry Pi 4.

MHerbst commented 3 years ago

Thanks for the issue report. Can you give me some more information? I have tried to reproduce it, but was not successful.

The re-connect is made from the add-on to the CCU, exactly speaking to one service on the CCU for HM devices and to another service for HmIP devices. Especially the HmIP services needs longer until it accepts event registrations. If the add-on receives an event from one of the devies in error state or if a command can be successfully send, it should change the thing state to OK.

Joerg-Dr commented 3 years ago

@MHerbst

This suggestion is just an idea for a possible improvement. Your current solution with the extended wait time for the devices to become accessible worked fine, but I only tested it once.

As far as I have understood, the binding makes several reconnection attempts after the connection to the CCU had been lost. But after these attempts are used up, no more reconnection attempts will be made, even when the device will become available after some time. That was the situation I had seen with the homematic binding, before the extended wait time had been implemented.

So, the idea of my suggestion is to retry more connection attempts (at least one), every xx minutes (maybe every 5 minutes). This will have the result, that even after a much longer time an available Homematic (IP) device will be reactivated.

So this strategy does not rely on an event to be received from the CCU, but it would use a timer to re-initiate one more connection attempt after every xx minutes for those devices that are still on error state.

Joerg-Dr commented 2 years ago

@MHerbst

Do you plan to implement the suggested enhancement, so in case of a lost device communication the homematic binding will try to re-establish a connection automatically?

It is not urgent for me, please just let me know if this idea is something you might want to follow on.

If I can be of any help please let me know.

MHerbst commented 2 years ago

@Joerg-Dr I have created a PR for a general improvement of the re-connect handling (it is not merged because I need to change parts of the implementation). I hope that it will solve most of these problems. Therefore I would recommend to test whether these changes resolve the problem.

The connection between OH and the CCU is not per device. There is one connection for each device type (classical HM device, HmIP devices, ...). If one device is not ready it would probably not really help to perform a full re-connect cycle.

If a device stays after an automatic reconnect in error state it should change its state as soon as openHAB received a new event from this device or if you try to execute an action. If this does not happen it would be better to fix the state handling for these situation. You could help by providing me some detailed logs if you have this situation. The following would help: if after a restart a device remains in error state, please enable the TRACE log mode for the HM binding and try to perform an action for this device or wait for an event from the CCU. If the device does not change its state, please attach the log here and I can try to figure out whats going wrong.

MHerbst commented 2 years ago

In the mean time during some tests I was able to reproduce the problem with one of my HmIP devices. Strangely it only happens sometimes because it is some sort of timing issue. I will further investigate.

MHerbst commented 2 years ago

I have modified the reconnect handling in PR #11429 . In my environment, I no longer had problems with HmIP devices. But there is a known problem in the CCU: sometimes it can take about 5 min until the CCU itself has successfully reconnected a device.

If your problem also happens with the implementation of the PR, I would need some log information.

Joerg-Dr commented 2 years ago

@MHerbst

Unfortunately I am not familiar with Github (not yet at least). Is it possible you send me a .jar file of your latest changes, so I could use it for testing.

MHerbst commented 2 years ago

@Joerg-Dr You can download it from here: https://github.com/MHerbst/openhab-addons-test

Joerg-Dr commented 2 years ago

@MHerbst

Thank you for the new version. I installed and tested it, here is the result:

Here ist the Debug log file: Homematic 10-12-2021.log

Joerg-Dr commented 2 years ago

After pressing "Save (CTRL-S)" on the Thing "Wohnzimmer.HmIP-STHD" page, the HmIP device immediately changed to "online" status.

Homematic 10-12-2021 (after Pressing Save).log

This is the function I suggested: An automatic re-initialization of any devices that stay in "Error" mode after a certain time.

MHerbst commented 2 years ago

@Joerg-Dr First if all in the log file I can see this message: Connection only partially restored. It is recommended to restart the binding. Therfore I would recommend to increase the "Callback Reg. Timeout" value. A CCU2 needs need probably longer than 2 min.

If this does not help, please test what happens if the binding receives an event from the device or what happens if you change an item value. Changes the state to "Online"? If not, please set the log mode to TRACE and try it again. It would be interesting to see why the state does not change in these situations. I know a way how I could implement an automatic "recheck" but I would like to avoid if possible.

Generally I think, the current behaviour is more a problme on the Homematic site. It seems that daemon servicing requests for HmIP devices gets into trouble if it has to answer to multiple requests in a rather short time. That's why sometimes a device gets an error state. In my environment this was really hard to reproduce (I have got only 4 HmIP devices).

Joerg-Dr commented 2 years ago

@MHerbst Thank you for your feedback.

Increasing the "Callback Reg. Timeout" time from 120 to 300 seconds solves the issue with the "slow" CCU2. But indeed my intention was to provocate an error, to see if an automatic re-check and re-enabling of a device would work.

My proposal of an automatic re-enabling was only an idea to handle situations where a connect timeout had occured, no matter why it happened. In that way errors could possibly be repaired without user interaction, which could be of advantage .

But, for me it would also be fine to leave it like it is, what ever you like.

MHerbst commented 2 years ago

@Joerg-Dr Thanks for the retest.

In my opinion, parts of the binding would have to be completely reworked. E.g. I would prefer to create an own bridge thing for each connection type (HM-RF, HmIP, Groups, CuxD). But would be a breaking change. With a rewritten connect handling it would be much easier to add additional device-specific checks. In the near future, I would not change anything for the time being, because I simply do not have the time.

But we can leave this issue open as a reminder.

Joerg-Dr commented 2 years ago

@MHerbst

because I simply do not have the time

I understand very well, I have the same problem :-) If you need any help for future testing, please let me know.

Elle4u commented 2 years ago

I have a (similar?) problem, too. I think its the same root-cause: When the connection to the CCU is broken for time x (for example reboot of router), the binding will do a reconnect and can communicate with most of the devices. But for HmIP-devices there are no longer any updates to the status. So the status expires after x hours (if the expired-addon is used). In this cases I manually disable and enable the CCU-bridge and everything works fine again.

Joerg-Dr commented 2 years ago

Yes, this is the cause of the problem and the reason I made this suggestion, so that reconnects should be done automatically. But as MHerbst mentioned, this would be not so easy to implement.

@MHerbst I have seen that the latest OpenHAB update installed new versions of the AddOns. Is there a new version of the Homematic Binding also included? Are the enhancements you have done recently already included in the current version?

MHerbst commented 2 years ago

The reason for the problem is the CCU and how it handles HmIP devices. After a restart all "old" HM devices are immediately online. But for some HmIP devices it can take about 5 min. until they are online in the CCU. After a lost connection, the binding tries to reconnect with the CCU as soon as possible. If this happens before all devices are online in the CCU, some of them will appear as offline. I had the hope that the thing state would automatically change if the binding receives new events for those things. It seems that this not the case. It even seems to depend on the device type. Maybe an implementation of @Joerg-Dr s proposal is easier than I thought before. But I need some time to investigate further.

@Elle4u Have you updated to 3.2? The reconnect handling should be better in this version, but probably not 100% reliable because of reasons I mentioned above.

@Joerg-Dr Yes the new version of the binding is included in OH 3.2 and the enhancements are included (they were merged right in time).