openhab / openhab-addons

Add-ons for openHAB
https://www.openhab.org/
Eclipse Public License 2.0
1.86k stars 3.56k forks source link

openHAB get into deadlock blocking in Discovery service #8419

Closed sascha777 closed 3 years ago

sascha777 commented 3 years ago

Hi all,

I struggle with a non-responsive Hue binding (and some others) that occurs every few hours since OH 2.5.6.

Description of the issue: Since 2.5.6 (tested also with 2.5.7. and 2.5.8) I encounter problems with my hue binding on my Pi 4. Every few hours the binding gets into a non-responsive state. This was discussed in the OH community in granular detail here.

When I disable the upnp discovery service in Karaf everything is running normally. So does having the harmony binding disabled. Having upnp discovery and harmony binding enabled a deadlock occurs every few hours. This also causes a few other binding to be non-responsive, such as the Homematic binding. After a full OH restart everything runs again, but only for a couple of hours.

The following services are affected (monited through my Grafana persitence charts):

These bindings continue to work properly:

Expected behaviour All services run normally without a deadlock situation.

Debug history: My finding is that the non-responsive state come from a deadlock that I monitored in Karaf:

openhab> threads --monitors --locks|grep BLOCKED
“OH-thingHandler-1” Id=222 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-thingHandler-3” Id=229 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-thingHandler-4” Id=232 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-thingHandler-5” Id=233 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“upnp-main-243” Id=2133 in BLOCKED on lock=java.lang.Object@197628b
“OH-discovery-75” Id=2193 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“upnp-main-258” Id=2242 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-77” Id=2244 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“HarmonyDiscoveryServer(tcp/39185)” Id=2245 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“upnp-main-262” Id=2261 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-83” Id=2550 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-84” Id=2590 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“OH-discovery-85” Id=2642 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“HarmonyDiscoveryServer(tcp/40065)” Id=2690 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9
“HarmonyDiscoveryServer(tcp/35275)” Id=3111 in BLOCKED on lock=org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1a7e1c9

I have played around disabling/enabling some services through Karaf.

With a downgrade of my OH system to 2.5.5 everything is running normally with all (the same) services enabled.

Assumption Having checked the commits on 2.5.6 I saw changes in the upnp binding with I assume to hold the reason for the deadlock. @Hilbrand pointed me to this:

In the parent class of the handler a number of methods are synchronized. It’s possible both the handler and discovery access different methods at the same time causing them to wait for each other.

Am I able to solve this issue on my own? Unfortunately not. As I don't have deep programming experience I cannot dig deeper into this. But I am willing to help with testing if I get some directions of what to check.

Temporary workaround I can disable upnp discovery until I need it to join new things.

Does anyone have an idea of what this could be? Or anything further that I can debug?

Hilbrand commented 3 years ago

It's unrelated to the upnp control binding. upnp is a general protocol used by multiple bindings. So it's not likely this is related to the upnp control binding, but the general upnp service in openHAB.

The threads commands also shows the complete stacktrace of each blocking thread. Can you add those stack traces? It will help identify the origin of the thread locking.

sascha777 commented 3 years ago

Here is a full stacktrace of the deadlock situation.

Hilbrand commented 3 years ago

At first sight it might be the samsung tv binding:

"upnp-main-770" Id=5495 in BLOCKED on lock=java.lang.Object@c36ba1
     owned by ESH-httpClient-samsungtv-1541 Id=1541
    at org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:90)
    at org.openhab.binding.samsungtv.internal.protocol.RemoteControllerWebSocket.closeConnection(RemoteControllerWebSocket.java:201)
    at org.openhab.binding.samsungtv.internal.protocol.RemoteControllerWebSocket.close(RemoteControllerWebSocket.java:210)
    at org.openhab.binding.samsungtv.internal.service.RemoteControllerService.stop(RemoteControllerService.java:240)
    at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.stopService(SamsungTvHandler.java:327)
      - locked org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler@173e09e
    at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.shutdown(SamsungTvHandler.java:186)
    at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.thingRemoved(SamsungTvHandler.java:370)
    at org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl$2.run(DiscoveryServiceRegistryImpl.java:291)
    at org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl$2.run(DiscoveryServiceRegistryImpl.java:1)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl.thingRemoved(DiscoveryServiceRegistryImpl.java:288)
      - locked org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1166469
    at org.eclipse.smarthome.config.discovery.AbstractDiscoveryService.thingRemoved(AbstractDiscoveryService.java:298)
    at org.eclipse.smarthome.config.discovery.upnp.internal.UpnpDiscoveryService.remoteDeviceRemoved(UpnpDiscoveryService.java:213)
    at org.jupnp.registry.RemoteItems$4.run(RemoteItems.java:234)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    Locked synchronizers: count = 1
      - java.util.concurrent.ThreadPoolExecutor$Worker@218c26

    Locked monitors: count = 2
      - org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler@173e09e locked at
          4 org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.stopService(SamsungTvHandler.java:327)
      - org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl@1166469 locked at
          10 org.eclipse.smarthome.config.discovery.internal.DiscoveryServiceRegistryImpl.thingRemoved(DiscoveryServiceRegistryImpl.java:288)

This binding calls from the discovery service back to the handler, while there are also calls from the handler to the discovery service on synchronized methods. If calls happen from multiple threads they could block each other.

There is actually a pull request to refactor part of this code (#7499) to fix the current implementation that is already questionable. So if it is indeed the samsung tv binding. this pr might even fix it.

To tests if it is the samsung tv binding can you disable/remove the samsung thing(s). Remove/Disable them because I'm not sure if disabling discovery only for the samsung tv binding might work here as it is interwoven with the handler, so disabling might still lead to calls to discovery. If that does proof the cause we can test if an updated version with the changes of the pull request might fix the problem.

sascha777 commented 3 years ago

👍 I removed the samsung things + binding and restarted upnp discovery service. Will keep watching and report back.

lolodomo commented 3 years ago

There was no change of the core framework since 2.5.0. The last change in the Harmonyhub binding seems to be PR #6636 merged the 20th of Dember and so probably included sice version 2.5.1. Why should it fail starting at 2.5.6 while there were no changes in the binding and in the core framework ?

Hilbrand commented 3 years ago

@lolodomo This has very likely nothing to do with the Harmonyhub (I'll update the title). See my other comment. With the information know I would say this is due to the samsung tv binding. As deadlocks can trigger due to timing related processes it's not always due to changes, but might also have been lingering, but simply was never triggered.

lolodomo commented 3 years ago

It was not mentioned that the SamsungTV binding was used too. It is well known that the SamsungTV binding in conjonction with the UPnP library is sometimes producing strange behavior for few users. We already have an issue opened for that. As it was mentioned that just uninstalling the HarmonyHub binding solved the problem, I was thinking about an issue in this binding.

lolodomo commented 3 years ago

If OH 2.5.5 is working well, it means the problem was introduced by one change in 2.5.6 in one binding you are using, Can you list all bindings you are using?

sascha777 commented 3 years ago

System is running fine without the samsung binding!

It was not mentioned that the SamsungTV binding was used too.

Sorry, in my ealier post I just listed the bindings for which I had persisted item states in my Grafana dashboards.

Here's a full list of my bindings:

$ cat /var/lib/openhab2/config/org/openhab/addons.config
:org.apache.felix.configadmin.revision:=L"47"
action=""
binding="astro,comfoair1,denonmarantz,exec,harmonyhub,homematic,heos,network,novelanheatpump1,systeminfo,zwave,miio,mqtt,telegram,hue,openweathermap,amazonechocontrol"
misc="openhabcloud"
package="standard"
persistence="influxdb,mapdb,rrd4j"
service.pid="org.openhab.addons"
transformation="exec,javascript,jinja,jsonpath,map,regex,xpath"
ui="dashboard,paper,basic"

As it was mentioned that just uninstalling the HarmonyHub binding solved the problem, I was thinking about an issue in this binding.

I am pretty sure that the system was also was running fine with having the samsung binding installed and the harmony service disbled. That's what i see in my history:

06.09.2020 | 09:21 | bundle:stop HarmonyHub Binding
06.09.2020 | 09:21 | bundle:start Configuration UPnP Discovery
--> no probs here
07.09.2020 | 08:28 | bundle:start HarmonyHub Binding
07.09.2020 | 08:29 | bundle:stop Configuration UPnP Discovery
--> no probs here
07.09.2020 | 12:28 | bundle:start Configuration UPnP Discovery
07.09.2020 | 06:49 | threads --monitors --locks
--> deadlock
07.09.2020 | 06:49 | bundle:stop Configuration UPnP Discovery
--> no probs here
08.09.2020 | 09:39 | bundle:start Configuration UPnP Discovery

If it helps I can retest this. But right now for me it looks like this:

UPnP | Harmony | Samsung | Deadlock
running | running | running | yes
stopped | running | running | no
running | stopped | running | no
running | running | uninstalled | no
lolodomo commented 3 years ago

I can try building a jar if you would like testing my PR for the SamsungTV binding. This PR is now relatively old as it interested noone.

Hilbrand commented 3 years ago

@sascha777 It would be great if you could test the updated SamsungTV binding.

sascha777 commented 3 years ago

Sure I can test this. Do you mean the older one from lolodomo or the current changes from the master (this looks only like a different base64 encoding handling)?

Terr4 commented 3 years ago

Hi guys,

Could this be related? https://community.openhab.org/t/hue-binding-commands-all-delayed-by-at-least-1-minute-since-2-5-10/109119/4

Hue Binding works for me again after uninstalling the network binding