openhab / openhab1-addons

Add-ons for openHAB 1.x
Eclipse Public License 2.0
3.43k stars 1.69k forks source link

Z-Wave binding should retry connecting to the interface device if first attempt failed #3487

Open JensErat opened 9 years ago

JensErat commented 9 years ago

I'm running OpenHAB in a docker container using tdecker's image, which is working fine. But when restarting the container (instead of creating a new one), sometimes Z-Wave breaks with following log lines:

2015-11-28 23:03:20.420 [DEBUG] [.zwave.internal.ZWaveActivator:36  ]- Z-Wave binding started. Version 1.8.0.201508012226
2015-11-28 23:03:20.445 [DEBUG] [.z.internal.ZWaveActiveBinding:175 ]- ConverterHandler not initialised. Polling disabled.
2015-11-28 23:03:20.446 [INFO ] [.z.internal.ZWaveActiveBinding:325 ]- Update config, port = /dev/ttyACM0
2015-11-28 23:03:20.446 [INFO ] [.z.internal.ZWaveActiveBinding:330 ]- Update config, healtime = 4
2015-11-28 23:03:20.446 [INFO ] [.z.internal.ZWaveActiveBinding:375 ]- Update config, setSUC = false
2015-11-28 23:03:20.446 [INFO ] [.z.internal.ZWaveActiveBinding:384 ]- Update config, softReset = false
2015-11-28 23:03:20.447 [INFO ] [.z.internal.ZWaveActiveBinding:393 ]- Update config, masterController = true
2015-11-28 23:03:20.447 [DEBUG] [.z.internal.ZWaveActiveBinding:282 ]- Initialising zwave binding
2015-11-28 23:03:20.474 [INFO ] [b.z.i.protocol.ZWaveController:146 ]- Starting Z-Wave controller
2015-11-28 23:03:20.475 [INFO ] [b.z.i.protocol.ZWaveController:154 ]- Z-Wave timeout is set to 5000ms. Soft reset is false.
2015-11-28 23:03:20.475 [INFO ] [b.z.i.protocol.ZWaveController:407 ]- Connecting to serial port /dev/ttyACM0
2015-11-28 23:03:20.509 [ERROR] [b.z.i.protocol.ZWaveController:429 ]- Serial Error: Port /dev/ttyACM0 does not exist

The device is already there (ls -la /dev/ttyACM0 added in the startup script succeeds, even before OpenHAB is started), but there seem to be some race conditions, maybe because of something like connections to the stick sill open or similar problems.

This is really annoying, and I can also imagine other constellations where this would break. What about retrying connecting to the stick after a given time, maybe three times waiting 30 seconds in-between? The stick is a common Z-Wave ME ZME_UZB1.

4s1 commented 9 years ago

I ran into the same issue while writing my own openhab2 docker image.

Serial Error: Port /dev/ttyACM0 does not exist

The two things I did to fix the issue were

  1. make sure my openhab user is member of the group which has write permissions to the node /dev/ttyACM0
  2. remove any /var/lock/LCK..ttyACM0 before openhab startup (left by unclean serial shutdown?)
JensErat commented 9 years ago

As the image runs OpenHAB as root (something that really has to be changed), I'd say (1) should not be an issue. It indeed seems deleting /var/lock/LCK..ttyACM0 resolves the issue. If this is the case, I'd have to analyze why the lockfile is not removed during shutdown, and the bug report is more a kind of "bad error message that should be improved".

JensErat commented 9 years ago

Trying to reconnect still might be something to try, for example if you consider the Aeon Z-Wave stick with built-in battery to get the controller near the device to include. The Z-Wave plugin should be able to handle reconnected Z-Wave controllers.

cdjackson commented 9 years ago

While I don’t disagree, this isn’t so simple at the moment as there are issues with the java serial driver that makes this difficult.

I suggest we close this issue since it seems resolved.

JensErat commented 9 years ago

Which one is difficult -- providing reasonable error messages or the retrying-to-connect-stuff?

cdjackson commented 9 years ago

Which one is difficult -- providing reasonable error messages or the retrying-to-connect-stuff?

Well, for the error message, we just use the response from the driver. And the issue with reconnecting is that if you remove the zstick as you suggest, then the serial driver crashes, so reconnecting isn’t possible.

plajjan commented 8 years ago

I am seeing the same behaviour, i.e. OpenHAB reporting that /dev/ttyACM0 does not exist and for me it appears rather erratic with it working sometimes and sometimes it doesn't. It usually takes me a few tries to start OpenHAB. Just an hour ago I had to restart upwards of 15 times before it established a connection. I am not running OpenHAB in a container and I don't see a pattern in when it works and when it doesn't. My gut feeling is that it happens more often now then previously when I had fewer bindings loaded in OpenHAB. I did see #2760 discussing potential race conditions between bindings using serial ports and as I'm using at least one other binding that does use a serial port, this could be applicable to me but I don't have any further data.

As @cdjackson points out, it seems that OpenHAB dies completely when the USB stick is unplugged. I too have an Aeon stick so I am pulling it out every now and then to add new units to my Z-Wave network. Having OpenHAB not die would be nice. Having OpenHAB reconnect upon reinsertion of the USB stick would naturally be even nicer.

I haven't looked into the crash too much, can't we just catch an exception and trigger a reconnect (potentially timed at a later point in time)?

Anyway, I think we are dealing with two issues here.

I suggest we let this issue remain about performing connection retries until a connection can be established. If we can deal with the crash in another issue then we should automatically get connection retrying, not just at startup but continuously, when that issue is fixed.

cdjackson commented 8 years ago

On 3 Jan 2016, at 14:29, Kristian Larsson notifications@github.com wrote:

I haven't looked into the crash too much, can't we just catch an exception and trigger a reconnect (potentially timed at a later point in time)?

No - that was my previous point in the other thread (I think). From what I’ve seen, it completely bombs, so it must be a very low level issue. Currently we catch all exceptions and restart the threads if there’s been an exception within the send/receive thread. This doesn’t catch this issue which is why I believe it’s likely in the driver (i.e. outside of openhab).

I think that in general, if you’re having lots of problems with the COM port not opening, I would try and resolve this issue as this is probably not correct and may indicate another issue. I’m not saying here that I’m against retries, but this would be a bodge to solve another problem, and it’s probably also worth looking for the source of the issue on your system?

plajjan commented 8 years ago

@cdjackson, if it is some external factor preventing OpenHAB to open the serial device then yes, I agree that I should have that fixed but to me it looks like it's OpenHAB having issues.

No point in arguing based on guesses so I'm running some tests right now. I put together a dirty little shell script that continuously restarts openhab and tails the log files looking for failure/success of the zwave connection. My idea is to do this for when I only have the zwave binding loaded (as this was what I used initially with great success) and then with all other bindings/modules loaded to see if there is a difference.

If there is a difference I think it's fairly safe to conclude that OpenHAB is the issue, perhaps not the zwave module by itself but openhab as a whole. Even with the issue pinpointed to OpenHAB I agree that your argument about finding root cause still stands. Implementing reconnects in the zwave module would be a kludge if there is some actual race condition in serial device init in one or more OpenHAB modules.

plajjan commented 8 years ago

Here's the comparison of 50 starts between having just the zwave module and all my modules, which are the following:

org.openhab.action.xbmc-1.7.1.jar
org.openhab.binding.benqprojector-1.8.0-SNAPSHOT.jar
org.openhab.binding.exec-1.7.1.jar
org.openhab.binding.hue-1.8.0-SNAPSHOT.jar
org.openhab.binding.plex-1.7.1.jar
org.openhab.binding.sonos-1.7.1.jar
org.openhab.binding.xbmc-1.7.1.jar
org.openhab.binding.zwave-1.8.0-SNAPSHOT.jar
org.openhab.io.habmin-1.5.0-SNAPSHOT.jar
org.openhab.io.multimedia.tts.marytts-1.7.1.jar
org.openhab.persistence.influxdb-1.7.1.jar

And the result:

root@pompous:~# cat openhab-serial-zwave-only.log | awk '{ print $3 }' | sort | uniq -c
     50 success
root@pompous:~# cat openhab-serial-all-modules.log | awk '{ print $3 }' | sort | uniq -c
     33 failure
     17 success
root@pompous:~#

I think we can conclude that there is a problem between modules since just using zwave is 100% success. I have yet to single out the module (if there is one) that is causing the conflict.. will do as soon as I have some time :)

For reference, here's the test script, if someone else wants to do the same:

#!/bin/bash

for I in `seq 1 50`; do
        echo "Stopping OpenHAB..."
        /etc/init.d/openhab stop
        echo "Emptying zwave debug log"
        echo "" > /var/log/openhab/zwave.log
        echo "Starting OpenHAB..."
        /etc/init.d/openhab start
        echo "Checking for serial connection..."
        while true; do
                echo -n "."
                sleep 1
                egrep "Connecting to serial port" /var/log/openhab/zwave.log && echo "  Found!" && echo -n "connect $I: " >> openhab-serial.log && break
        done
        while true; do
                echo -n "."
                sleep 1
                egrep "Serial Error: Port /dev/ttyACM0 does not exist" /var/log/openhab/zwave.log && echo "failure" | tee -a openhab-serial.log && break
                egrep "Serial port is initialized" /var/log/openhab/zwave.log && echo "success" | tee -a openhab-serial.log && break
        done
done
cdjackson commented 8 years ago

This might still be the same issue with the s rial driver that caused the race condition previously as I don't think that was ever properly solved.

I'll see how hard it is to add the retry over the next week or so.

Serial ports under Java are a problem :(

Sent from my iPhone

On 4 Jan 2016, at 14:00, Kristian Larsson notifications@github.com wrote:

Here's the comparison of 50 starts between having just the zwave module and all my modules, which are the following:

org.openhab.action.xbmc-1.7.1.jar org.openhab.binding.benqprojector-1.8.0-SNAPSHOT.jar org.openhab.binding.exec-1.7.1.jar org.openhab.binding.hue-1.8.0-SNAPSHOT.jar org.openhab.binding.plex-1.7.1.jar org.openhab.binding.sonos-1.7.1.jar org.openhab.binding.xbmc-1.7.1.jar org.openhab.binding.zwave-1.8.0-SNAPSHOT.jar org.openhab.io.habmin-1.5.0-SNAPSHOT.jar org.openhab.io.multimedia.tts.marytts-1.7.1.jar org.openhab.persistence.influxdb-1.7.1.jar And the result:

root@pompous:~# cat openhab-serial-zwave-only.log | awk '{ print $3 }' | sort | uniq -c 50 success root@pompous:~# cat openhab-serial-all-modules.log | awk '{ print $3 }' | sort | uniq -c 33 failure 17 success root@pompous:~# I think we can conclude that there is a problem between modules since just using zwave is 100% success. I have yet to single out the module (if there is one) that is causing the conflict.. will do as soon as I have some time :)

For reference, here's the test script, if someone else wants to do the same:

!/bin/bash

for I in seq 1 50; do echo "Stopping OpenHAB..." /etc/init.d/openhab stop echo "Emptying zwave debug log" echo "" > /var/log/openhab/zwave.log echo "Starting OpenHAB..." /etc/init.d/openhab start echo "Checking for serial connection..." while true; do echo -n "." sleep 1 egrep "Connecting to serial port" /var/log/openhab/zwave.log && echo " Found!" && echo -n "connect $I: " >> openhab-serial.log && break done while true; do echo -n "." sleep 1 egrep "Serial Error: Port /dev/ttyACM0 does not exist" /var/log/openhab/zwave.log && echo "failure" | tee -a openhab-serial.log && break egrep "Serial port is initialized" /var/log/openhab/zwave.log && echo "success" | tee -a openhab-serial.log && break done done — Reply to this email directly or view it on GitHub.

plajjan commented 8 years ago

Just googling for a few permutations of "rxtx", "crash", "java serial" and similar I see quite a few recommendations to use jSSC and there are examples of other projects, like Arduino, switching from rxtx to jssc due to the unmaintained and bad shape rxtx is in. Is this something we would consider as well?

cdjackson commented 8 years ago

I'd need to check the constraints but I looked at a number of options about 12 months back and there isn't an ideal solution. Many libraries have compatibility issues (ie only supported on certain platforms) or aren't maintained, or have license issues...

I'll have another look at jSSC as I'm not against changing if there's a better solution.

Sent from my iPhone

On 4 Jan 2016, at 14:06, Kristian Larsson notifications@github.com wrote:

Just googling for a few permutations of "rxtx", "crash", "java serial" and similar I see quite a few recommendations to use jSSC and there are examples of other projects, like Arduino, switching from rxtx to jssc due to the unmaintained and bad shape rxtx is in. Is this something we would consider as well?

— Reply to this email directly or view it on GitHub.

plajjan commented 8 years ago

I ran another test series, this time with the zwave binding and benq;

root@pompous:~# cat openhab-serial-zwave-benq.log | awk '{ print $3 }' | sort | uniq -c
     26 failure
     24 success
root@pompous:~#

So we can see that using benq together with zwave certainly lowers the odds of success but not as much as with all the modules. This is roughly 50/50. These are the only two modules I'm using that I have configured to speak over serial devices but perhaps one of the other modules include some serial snippet that starts to compete as well because with all modules loaded I'm closer to 1/3 (17 success out of 50).

Also, I haven't looked at the success rate of the benq binding with the zwave binding loaded.

cdjackson commented 8 years ago

Sounds like the same race condition to me - not surprising as it wasn't fixed previously.

Creating a retry in the zwave binding will potentially solve the issue for Zwave but of course won't fix the problem for other bindings using the serial services.

Really this needs a solution at source - either fix the driver or we all change to a new library....

Sent from my iPhone

On 4 Jan 2016, at 15:03, Kristian Larsson notifications@github.com wrote:

I ran another test series, this time with the zwave binding and benq;

root@pompous:~# cat openhab-serial-zwave-benq.log | awk '{ print $3 }' | sort | uniq -c 26 failure 24 success root@pompous:~# So we can see that using benq together with zwave certainly lowers the odds of success but not as much as with all the modules. This is roughly 50/50. These are the only two modules I'm using that I have configured to speak over serial devices but perhaps one of the other modules include some serial snippet that starts to compete as well because with all modules loaded I'm closer to 1/3 (17 success out of 50).

Also, I haven't looked at the success rate of the benq binding with the zwave binding loaded.

— Reply to this email directly or view it on GitHub.

plajjan commented 8 years ago

I agree with this, but I would like to have reconnects as well since that would allow us to start working again after unplug/replug of USB stick (when doing Z-wave inclusion of new devices) :)

cdjackson commented 8 years ago

No. This won't work. When you unplug the stick oh crashes and needs to be restarted. This is will need to be solved first and I believe this is a bug in the serial driver.

Sent from my iPhone

On 5 Jan 2016, at 23:30, Kristian Larsson notifications@github.com wrote:

I agree with this, but I would like to have reconnects as well since that would allow us to start working again after unplug/replug of USB stick (when doing Z-wave inclusion of new devices) :)

— Reply to this email directly or view it on GitHub.

plajjan commented 8 years ago

I know it won't work now but once rxtx is fixed or replaced by jssc it will work :)

Just saying I want two activities:

Did you have a chance to look at jssc? I might give it a shot to implement this for a module to try it out. Would probably start with the Benq projector binding since I wrote the serial transport for that.

cdjackson commented 8 years ago

I think we should separate these out then as you're now asking for something quite different. Reconnecting after removing the stick to include more devices is not the same as retrying initial connection and will require more re initialisation of the binding.

I looked at jSSC briefly - I noted it was quite old (not maintained for 2 years I think). I need to look at it further though...

Sent from my iPhone

On 6 Jan 2016, at 10:37, Kristian Larsson notifications@github.com wrote:

I know it won't work now but once rxtx is fixed or replaced by jssc it will work :)

Just saying I want two activities:

fix rxtx crash or replace with jssc provided it looks better overall do reconnect attempts whenever connection is lost Did you have a chance to look at jssc? I might give it a shot to implement this for a module to try it out. Would probably start with the Benq projector binding since I wrote the serial transport for that.

— Reply to this email directly or view it on GitHub.

cdjackson commented 8 years ago

I've had a read through the open issues with jSSC, and there are quite a few very worrying ones. It seems to me that there's no support - the maintainer made the following statement 1 year ago (but there have been no updates to code for 2 years it seems) -:

Hi guys =) The project is not abandoned, and I try to support the critical things. At this moment I think that jSSC (2.8.0) is a stable lib, and I use it by my self in production day by day. I know about jSSC weak points, on my opinion these are critical issues:

Default states for RTS\DTR lines. jSSC now included in Arduino IDE and in Processing 2.x, and I know it's really important thing to support. For example for prevent Arduino board restart on port opening. Methods for reading data with a correctly formed timeouts. At this moment jSSC use requests for getting input buffer bytes count in loop, and it make a hi CPU load. A good SerialPortEventListener native implementation for NIX systems. At this moment listener for NIX systems use a loop of different requests and also make a hi CPU load. Windows version use a 'WaitCommEvent' function, but *NIX has no equivalent function. Support JDK6, yes, some commits contains additional functionality Java7/8, but I think that it's syntactic sugar. For example 'Closable' interface, yes it's great but I think that developers should not miss library support on many platforms in cause of this sugar. I try to make some experiments for find solutions for this issues. But as I understand the native part should be completely modified, and may be it will be not compatible with current version. About some commits with Java7/8 support I think that they should be in additional package, not in main jSSC library to make it compatible with JDK6. For additions it will be cool to have a SerialPortExt class for example in some jSSC extension package. It's very important for production projects because where a many internal software that works under Java 4/5/6 (jSSC-2.8.0 for example support Java 5).

On my opinion jSSC core need a deep refactoring and first of all for native part and surely with listed points support =)

It seems that the support on linux is poor (ie it seems it uses polling) and I suspect if this is still the case (and I assume it is since there has been no activity), it will be a showstopper for OH since many people use low end linux distros (eg Debian on RPi). Having a high CPU due to polling is (IMHO) a non-starter as the Pi is already low end. (There is also an open issue specifically saying 100% CPU load on ARM).

I also see that @maggu2810 has looked at this around the time we were looking at alternatives due to the rate condition. Maybe he can also comment?

I would not be keen to move to a different library that is not well used/supported. Yes, at the moment, there are a few issues, and while this is not good, and clearly a significant issue for those experiencing the issues, but it is stable and usable for most people. I would be concerned that moving to another library with known issues will lead to further problems - at least in the short term so we should be clear that whatever we do is a positive move before we make changes...

As I said earlier - Serial comms in Java is not well supported :(

dvanherbergen commented 8 years ago

AFIAK openHAB uses https://github.com/NeuronRobotics/nrjavaserial for the serial transport. This is an actively maintained fork from RXTX. If there is an issue, it can be raised there. At the very least, you could try using the latest version 3.11.0. openHAB is still at version 3.9.3.

cdjackson commented 8 years ago

Yes, that's correct. The point that we're making here is we're looking for an alternative (as we did 12 months back) as there are problems with the existing library...

maggu2810 commented 8 years ago

Hi @cdjackson, I have a look at the code of jSSC long time ago. There has been memory leaks etc. in the native code for linux systems. I tried to do some work on that, but as the projects seems to be unmaintained, I stopped any further work on this.

I also tried (that time) some work on nrjavaserial, but the code is very blown and I believe that this has historic reasons. Not easy to implement new stuff without finding someone who knows why the old (forked) code is designed such way. The thing I wanted to change in the library has been that there should be an easy / working mechanism to use a non-well known serial port without changing a system property (race condition candidate).

maggu2810 commented 8 years ago

IIRC: I have also realized that jSSC don't use the select system call (or something similar) to read stuff from the serial line. So they waste a lot of CPU time using polling. I started to change that but I stopped it caused by the "dead" (or stalled) upstream development (and I could work on the native code of linux only).

plajjan commented 8 years ago

Okay, but then it should be a matter of opening a bug with nrserial. I don't know why I got the idea we are using rxtx.. sorry about that.

maggu2810 commented 8 years ago

Have you read https://dev.eclipse.org/mhonarc/lists/iot-wg/msg00366.html and the following replies e.g. https://dev.eclipse.org/mhonarc/lists/iot-wg/msg00369.html I have not done any action here, but as this seems to be unmaintained to (but well licensed) we could fork the source code and create (perhaps) a new project to maintain this serial driver ourself. Don't know if someone would like to work on this, but the license of the rxtx / nrjavaserial is at least a blocker for ESH.

cdjackson commented 8 years ago

I would certainly be keen to see something ESR compatible. I discussed with Kai a while ago trying to get a direct USB driver working, but this didn't work out. As you say though, the current solutions are not workable for ESH.

We "just" need to find @someone to work on it :)

maggu2810 commented 8 years ago

I have added the dk.comm stuff to a Github repo of my own, so I can build it as a separate project. I did also some pom.xml and so Manifest cleanup etc.

Will see when I will find some time to do further testing, bugfix, etc.

IOOOTAlan commented 8 years ago

Greetings, I know that this suggestion cannot be used in the current openhab codebase because it currently supports Java 7, but has anyone ever looked at the Device IO API introduced in OpenJDK 8?