openhab / openhab1-addons

Add-ons for openHAB 1.x
Eclipse Public License 2.0
3.43k stars 1.69k forks source link

Knx failed to reconnect and other bindings stop working #1068

Closed MMax2 closed 8 years ago

MMax2 commented 10 years ago

I tested it with a Jung IP-Schnittstelle IPS 200 REG (LAN cable). OpenHAB 1.4.0 stable version. I find it very easy to reproduce the problem: after all the system is up and running, I unplug the LAN cable and try to send a command to the Knx by the UI. The server says: KNX link has been lost (reason: maximum send attempts on object link 192.168.1.27:3671 tunnelling mode (closed), TP1 hopcount 6) - reconnecting... And after 16 ms: Error connecting to KNX bus: null Then: KNX link has been lost! And: KNX link will be retried in 30 seconds. After that, all other bindings stop working. Then I reconnect the cable, but other plugins don't work. After 30 seconds the server says: Estabilished connection to KNX bus on 192.168.1.27:3671 in mode TUNNEL. Now, if for example I change the state of a switch, nothing is displayed on the server. I think knx binding stops working, too, like other plugins. Almost never it reconnects correctly: most of the times it fails. Massimo

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

MMax2 commented 10 years ago

Same behavior with Jung KNX IP-Router REG IPR 100 REG. Massimo

teichsta commented 10 years ago

Hi @MMax2 could you please start in debug mode (start_debug.xxx) and post the openhab.log of a clean session where you've walked through the steps mentioned above? Thanks, Thomas E.-E.

MMax2 commented 10 years ago

Hi! Sorry for the delay, but I investigated further about this issue, and I prepared a very reduced sample test environment which replicates the issue.

File openhab.cfg:

knx:ip=192.168.1.52
knx:type=TUNNEL
knx:port=3671
knx:pause=50
knx:autoReconnectPeriod=30

File test.items:

Switch Luce_1 {knx="1.001:0/0/1+<0/7/1"}

File test.sitemap:

sitemap test label="test" {
 Frame label="Home" {
  Switch item=Luce_1
 }
}

In the webapps directory I added a "test" directory with jquery 1.11.0 and the following: File index.html:

<!DOCTYPE html>
<html>
 <head>
  <title></title>
  <script src="jquery-1.11.0.min.js" type="text/javascript"></script>
  <script src="index.js" type="text/javascript"></script>
 </head>
 <body>
  <button id="btnStart">Start</button>
 </body>
</html>

File index.js:

$(function () {
 $("#btnStart").on("click", function () {start();});
});

function start() {
 var request = $.ajax({
  type: "POST",
  url: "http://localhost:8080/rest/items/Luce_1",
  data: "ON",
  headers: { 'Content-Type': 'text/plain' }
 });
}

The function start() in index.js is exactly the same as in the REST Samples.

Now I follow these steps:

  1. Launch openHAB (start.bat).
  2. Open Google Chrome and type http://localhost:8080/test/index.html
  3. If you push the button, all is correct.
  4. Now switch off the bus (or unplug the network cable).
  5. Press the Start button on the browser and wait.
  6. The server says: "KNX link has been lost!" (and other plugins stop working, if there are any).
  7. Switch on the knx bus.
  8. Connection seems to be re-estabilished, but it isn't: infact if you try to switch on and off the light from the konnex, openhab server doesn't receive the events.

If you try the same steps with Classic UI, it works correctly, but the server doesn't stop if you stop it with CTRL-C: it says Stopped REST API and Stopped Classic UI, but the dos windows continues to receive knx events.

Thank you Massimo

MMax2 commented 10 years ago

Same result with OH 1.5.0. Massimo

teichsta commented 10 years ago

Hi Massimo, thanks for investigating further! Something seems to block the whole eventing mechanism here. Did you already have a look into the knx binding code? Any idea what could cause this blocking behaviour? Best, Thomas E.-E.

MMax2 commented 10 years ago

In my opinion, issue #851 is different then #1068. In issue #851 we were asking for Knx to reconnect even on startup. At present Knx will reconnect only if the first connection on startup is successful, otherwise it never reconnects. In issue #1068 instead I noticed this strange blocking behaviour when Knx tries to reconnect. So I ask you to reopen issue #851. As far as I'm concerned, I'll try to have a look into the knx binding code to understand where the block is, but now I don't have time to do so quickly. So please be patient!

Snickermicker commented 10 years ago

Issue #851 differs IMHO from #1068. The binding tries to connect for the first time, when the binding receives an update message from OSGI for it's config data. When connection at this point fails, no further try is started. I'm currently implementing a timer based reconnect.

1068 appears to be different. It seems, that a timer thread tries to reconnect once the connection was lost and cannot be immediately reopened.

Snickermicker commented 10 years ago

I've provided a fix for #851 for branch 1.5.1, which tackles the problem of KNX being unavailable at startup for TUNNEL connections.

But I couldn't reproduce the original described erroneous behavior (with a Siemens IP interface, though). Even when I tried a fresh install (1.5.1) with your test configs. Could you (by any chance) provide a debug log based on 1.5.1?

The behavior I'm seeing is that after a connection is lost it takes a while until the IP interface gets it's internal state sorted and a few reconnects fail. After that I'm always getting a working connection.

MMax2 commented 10 years ago

Thank you for your fix. I can't test quickly because I'm involved in another project at the moment, but when I went back to openHab surely I provide the debug log. Sorry Massimo

Snickermicker commented 10 years ago

Guess I found the real issue. Calimero seems not to be thread safe (at least not the way it's currently being used by the knx binding). When the initial connection is lost a timer thread is started trying to reconnect. It appears that after reconnecting and when the connection is lost (again), then calimero seems to wait for the (new) thread to terminate, which is not the intend of this thread. No idea yet how to solve this one.

MMax2 commented 10 years ago

Wow! Very hard-to-find issue, congratulations! In your opinion, is it a Calimero issue or is it a OpenHAB issue due to the way it's using Calimero?

Snickermicker commented 10 years ago

Not really sure. Could be both, since Calimero docs don't seem to touch the issue of multi-threading (at least I couldn't find anything).

LordasSmile commented 9 years ago

Hi together! Today I came across this described issu, after I made some software updates on my fritzbox yesterday! As I did some review on my openhab server today, I determined that the knx connection was lost yesterday and not reconnecteed itself after my fritzbox was alive again.

I'm on 1.5.1. As I can see the issue was removed from 1.6.0 milestone! So I think the problem still exists on 1.6.1. Will it be analyzed furthermore? Thank you very much!

MMax2 commented 9 years ago

I don't understand: on what version did you find the issue? 1.5.1 or 1.6.1?

LordasSmile commented 9 years ago

Hi @MMax2, I'm on 1.5.1 right now! I saw above that @teichsta removed the issue from 1.6.0 milestone, so I thought it is still not fixed yet in 1.6.1. I'm right? Thx!

teichsta commented 9 years ago

Hi,

@Snickermicker send a fix for #851 with PR https://github.com/openhab/openhab/pull/1483 for 1.5.1 these days. It seems we've missed to cherry pick this fix into 1.6.0. I am not sure anymore if he sent a second PR for 1.6.0, too. I've asked for a short update on this.

Best, Thomas E.-E.

Snickermicker commented 9 years ago

Yes, I merged that fix into 1.6.0 with #1517. But as I wrote before, this is only partly fixing the problem. I saw a problem when connection is lost in gateway mode. At first glance it looked as if the binding is stuck sporadically in calimero lib. Contacting the calimero maintainers didn't reveal any insights. So, this is currently unfixed.

teichsta commented 9 years ago

ok … so we have to partial fix on 1.6 already but this did not entirely fix the problem.

@Snickermicker could you please add the link to the Calimero Issue here as a Reference? Hope to get them moving a bit.

Thanks, Thomas E.-E.

Snickermicker commented 9 years ago

Sure thing: calimero #14

teichsta commented 9 years ago

Thanks! Have you had the time to follow their suggestions (check method fire())?

Snickermicker commented 9 years ago

Yes, but this didn't help me.

hmerk commented 8 years ago

@Snickermicker has this been solved meanwhile ? I do not see any activity for more than a year and would rather close this issue.