openhab / openhab-addons

Add-ons for openHAB
https://www.openhab.org/
Eclipse Public License 2.0
1.86k stars 3.58k forks source link

[rfxcom] JVM crashes when using socat/ser2net #3308

Closed curlyel closed 3 years ago

curlyel commented 6 years ago

I’m expirencing openHAB 2.2 crashing silently without exceptions or any useful entries in the logs. During the work with openHAB 2.2 I came across three ways to "trigger" the issue, to let openHAB die (restart) without any message in the openhab.log:

1. Write out a modified .things file for example for RFXCom (rfxcom.things)

There’s nothing special in, see:

Bridge rfxcom:bridge:rfx1 [ serialPort="/dev/rfxcom", disableDiscovery=true] {
    Thing rain                          RainOregon      [ deviceId="36352", subType="PCR800" ]
    Thing temperaturerain               RainAlecto      [ deviceId="61441", subType="WS1200" ]
    Thing temperaturehumidity           THKindZi        [ deviceId="37122", subType="THGN122/123" ]
    Thing temperaturehumidity           THSchlaZi       [ deviceId="1284", subType="THGN122/123" ]
    Thing temperaturehumidity           THDachSt        [ deviceId="56068", subType="THGN122/123" ]
    Thing temperaturehumidity           THKuehlschrank  [ deviceId="65028", subType="THGN132" ]
    Thing temperaturehumidity           THBuero         [ deviceId="65284", subType="THGN132" ]
    Thing temperaturehumidity           THBadOG         [ deviceId="60930", subType="THGN132" ]
}

2. Restart one of the RF-based bindings (zwave, zigbee, rfxcom) through Karaf-console

openhab> bundle:restart <id> In about apx. one out of three times to restart a bundle, openhab crashed silently

3. restarting openhab via systemctl

Sure, this will restart openhab :wink: , But, when using # systemctl restart openhab2.service to restart the openhab service, openhab will start as usual, loads the models, initialize things, starts updating items. But then, after 3 to 5 minutes, the silent crash is occuring leading to a (sometimes infinite) restart loop. To break this loop, I had to:

/# systemctl stop openhab2.service
(.... wait few minutes ....)
/# systemctl start openhab2.service

System

OpenHABian-script applied to stock Debian9 on a x86 based embedded system, 4GB ram, m-sata ssd (https://www.pcengines.ch/apu2c4.htm)

The system is dedicated to openHAB. Just openHAB, influx, grafana is running on it. No further services…

# uname -a
 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux
# java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
Openhab is 2.2-release.

USB dongles for the RF-based bindings are connected via ser2net/socat to a RPi putting them in a better rf position. Ethernet is wired to the remote Pi (no wifi).

Used bindings

openhab> bundle:list | grep Binding
111 │ Active   │  80 │ 0.10.0.b1              │ Eclipse SmartHome AutoUpdate Binding
112 │ Active   │  80 │ 0.10.0.b1              │ Eclipse SmartHome Core Binding XML
209 │ Active   │  80 │ 0.10.0.b1              │ Astro Binding
210 │ Active   │  80 │ 0.10.0.b1              │ Eclipse SmartHome hue Binding
211 │ Active   │  80 │ 0.10.0.b1              │ Sonos Binding
225 │ Active   │  80 │ 2.2.0                  │ Exec Binding
226 │ Active   │  80 │ 1.11.0                 │ openHAB Expire Binding
227 │ Active   │  80 │ 1.11.0                 │ openHAB FritzboxTr064 Binding
228 │ Active   │  80 │ 2.2.0                  │ HarmonyHub Binding
229 │ Active   │  80 │ 2.2.0                  │ Network Binding
230 │ Active   │  80 │ 1.11.0                 │ openHAB NetworkUpsTools Binding
231 │ Active   │  80 │ 2.2.0                  │ Rfxcom Binding
232 │ Active   │  80 │ 2.2.0                  │ Systeminfo Binding
234 │ Active   │  80 │ 2.2.0                  │ ZWave Binding
248 │ Active   │  80 │ 1.11.0                 │ openHAB KNX Binding
249 │ Active   │  80 │ 2.3.0.201801190819     │ ZigBee Binding

Persistence

influxdb and mapdb

wborn commented 6 years ago

openHAB could be crashing due to a segmentation fault in a library. If so it will not log to the default log files but there will be a hs_err_pid*.log file with additional information probably in /var/lib/openhab2. Can you can drag and drop the file into a comment so it gets attached?

The following system log files could also provide additional details:

curlyel commented 6 years ago

Hi @wborn, thanks for looking into it. Currently I have quite limited access to my installation. But I managed to get hold of the hs_err_pid*.log you mentioned. See:

hs_err_pid519.log hs_err_pid510.log

If some more information is required, just let me know. Appologies upfront for some delayed response since I'm on travel now. I promise to reply immediately after my return. ;-)

wborn commented 6 years ago

The logging shows that indeed openHAB crashes due to a problem in native code. It doesn't seem to be caused by the native code of some openHAB binding library. Maybe you can check if the system log files (syslog or kern.log) provide additional information? Happy travels!

curlyel commented 6 years ago

Thanks @wborn for checking the provided logs. As promised, I've tried now to gather more information about it. So I've triggered the crash through the write of the unmodified rfxcom.things file. Openhab crashed immediatly as expected without any helpful in the openhab.log or events.log. But the syslog shows indeed something related to it. /var/log/syslog:

Mar 31 16:28:02 Openhab karaf[22938]: /dev/rfxcom: No such file or directory
Mar 31 16:28:04 Openhab karaf[10143]: Can't connect to the container. The container is not running.
Mar 31 16:28:04 Openhab systemd[1]: openhab2.service: Control process exited, code=exited status=1
Mar 31 16:28:04 Openhab systemd[1]: openhab2.service: Unit entered failed state.
Mar 31 16:28:04 Openhab systemd[1]: openhab2.service: Failed with result 'exit-code'.
Mar 31 16:28:09 Openhab systemd[1]: openhab2.service: Service hold-off time over, scheduling restart.
Mar 31 16:28:09 Openhab systemd[1]: Stopped openHAB 2 - empowering the smart home.
Mar 31 16:28:09 Openhab systemd[1]: Started openHAB 2 - empowering the smart home.

Well, access to /dev/rfxcom is not possible anymore. Seems, that re-read the rfxcom.things leads to a close of the (remote) serial device /dev/rfxcom. This is created via socat using this script:

#!/bin/bash
LOCALPORT=/dev/rfxcom
REMOTEPORT=192.168.XXX.YYY:10001

while sleep 10 # infinite loop for checking socat connected to REMOTEPORT is still running
do
        if [[ $(ps ax | grep -v grep | grep tcp:$REMOTEPORT | wc -l) -eq 0 ]]
        then
                # restarting socat
                /usr/bin/socat pty,link=$LOCALPORT,echo=0,raw,waitslave,group=dialout,mode=660 tcp:$REMOTEPORT &
        fi
done

Finally, this is executing the following inside an infinite loop:

/usr/bin/socat pty,link=/dev/rfxcom,echo=0,raw,waitslave,group=dialout,mode=660 tcp:192.168.XXX.XXX:10001

... while the "real" serial devices are on a RPi with read-only filesystem running ser2net. /etc/ser2net.conf:

10001:raw:0:/dev/rfxcom:38400 8DATABITS NONE 1STOPBIT
10002:raw:0:/dev/zwave:115200 8DATABITS NONE 1STOPBIT
10003:raw:0:/dev/zigbee:115200 8DATABITS NONE 1STOPBIT

So if the re-read of the rfxcom.things files closes the serial device which leads to breaking the socat, the virtual (remote) serial device is not available for aprx. 10 seconds in my case. Something is not able to handle this "lost" serial device.

I know, especially the RFXCom binding supports direct connection to ser2net - but others don't AFAIK. You guessed already that I'm using the same method for the ZWave and the Zigbee binding as well. Until now, this was a quite mature way to support/connect remote serial devices. Would be fine to get this fixed or work around the crash at least ;-)

Happy travels!

Thanks ;-)

wborn commented 6 years ago

The JVM crashes may indeed be related to using socat/ser2net. I'm not using such a setup myself so it's difficult to troubleshoot for me. Looks like you already asked the community for help as well.

RFXCom binding supports direct connection to ser2net

If such compatibility is binding specific I would not expect complete JVM crashes but rather communication issues with incompatible bindings. Does it also crash when you just restart the RFXCom binding?

Until now, this was a quite mature way

Do you know what parameters changed in your setup when the crashes first started?

While looking for known socat issues, I found @kgoderis (https://github.com/NeuronRobotics/nrjavaserial/issues/96) is also using socat. Did your openHAB ever crash due to using socat? If so, did you find a solution for it?

kgoderis commented 6 years ago

@wborn @curlyel Nope - It didn't, but I struggled a lot in order to find a workable solution for the poor support for serial port based communication in Java. I ended up running socat in a separate process

/usr/bin/socat -v TCP-LISTEN:9000 /dev/ttyUSB0,raw,echo=0

and I rewrote the binding so that it supports TCP/IP based communication (e.g. it connects to port 9000 on the openHAB (local)host. This not the same setup as @curlyel but rewriting a binding to support TCP/IP it not that hard, and I considered this to be worthwhile in order to get rid of all these serial port problems in java (including the buggy native code this requires on almost any platform).

(Side note : @curlyel I run socat using systemd instead of a plain script as you do )

curlyel commented 6 years ago

Thanks @wborn, @kgoderis for sharing your thoughts on this. This in mind, I did some more testing: I've changed the RFXcom bridge to tcp:

//Bridge rfxcom:bridge:rfx1 [ serialPort="/dev/rfxcom", disableDiscovery=true] {
Bridge rfxcom:tcpbridge:rfx1 [ host="192.168.XXX.YYY", port=10001, disableDiscovery=true] {
    Thing rain                          RainOregon      [ deviceId="36352", subType="PCR800" ]
        ....
    Thing temperaturehumidity           THBadOG         [ deviceId="60930", subType="THGN132" ]
}

FTR: Ser2net config on the remote RPi is still the same, just removed the local socat session for Rfxcom and configured the binding to use TCP directly...

Doing so, the immediate crash when writing the changed/unchanged rfxcom.thing is gone. Even a restart of the Rfxcom binding through Karaf console as well as complete restart of openHAB can not cause the crash anymore.

More than this: With a restart of the Zigbee and the ZWave binding I'm not able to trigger it anmore (these two are still on the socat/ser2net remote serial ports since they currently have no built-in direct TCP support). Seems that I have wrongly suspected the Zigbee/ZWave binding causing it - sorry @cdjackson ;-)

It looks to me, that the serial port handling of just the Rfxcom binding was causing the behaviour. Don't see, why the both other serial connected bindings are affected by this too.

Anyway, using the TCP based communication for the RFXCom brought me at least a work around for the:

the poor support for serial port based communication in Java

Mmh, don't know, how we should proceed from here... Yes, I do have a solution for my case. But the shaky serial support is still there.

Should we ask for optional direct TCP support even for local serial port access (as Karel did)? This would give us the option to circumvent some (local) serial issues and give a convinient way for connecting to remote serial ports via ser2net. WDYT?

Jamstah commented 3 years ago

I've been playing with this recently, including starting and stopping bundles etc, and had no issues.

Is this still an issue for you?

hmerk commented 3 years ago

closed as no more relevant