openhab / openhab-core

Core framework of openHAB
https://www.openhab.org/
Eclipse Public License 2.0
931 stars 429 forks source link

Serial ports getting blocked after some re-connecting #1842

Closed fwolter closed 2 years ago

fwolter commented 4 years ago

When using the Modbus binding with OH3, the serial port with the Modbus RS485 transciever cannot be opened anymore after some time.

When the error appears, the following is printed on stdout: initialise_event_info_struct: initialise failed!

This is a known bug NeuronRobotics/nrjavaserial#111.

After compiling the native libs with DISABLE_LOCKFILES as suggested in above issue, the error is gone.

This has been discussed in https://community.openhab.org/t/modbus-binding-loses-its-serial-connection-every-few-days-without-a-reason-after-updating-to-2-5-0-release-build/89550/14

Maybe locking should be disabled in OH nrjavaserial again?

wborn commented 4 years ago

For me this issue is not that critical wanting to start forking nrjavaserial again. :wink: Forking nrjavaserial implies all other native libraries need to be recompiled as well (linux/armhf/arm64) and you'd have to merge any upstream changes to the fork.

fwolter commented 4 years ago

I understood you == me :) I'd do that if the pain gets too big. Until now there is only one user beside me facing the problem. Maybe @MrDOS manages to re-work the locking mechanism in the next time.

jumblejj commented 3 years ago

Hi, Also seeing this issues with OH3, see https://community.openhab.org/t/modbus-binding-not-working-on-oh3/111619

ssalonen commented 3 years ago

Reconnect issue discussed in more detail in this post https://community.openhab.org/t/modbus-binding-not-working-on-oh3/111619/26?u=ssalonen

It gets stuck at

at gnu.io.RXTXPort.waitForTheNativeCodeSilly(RXTXPort.java:958)

at gnu.io.RXTXPort.(RXTXPort.java:180)

at gnu.io.RXTXCommDriver.getCommPort(RXTXCommDriver.java:983)

Looking at latest code from github, it looks like MonitorThreadLock is never "released", leading to the deadlock.

https://github.com/NeuronRobotics/nrjavaserial/blob/40587564401997d31c1f2a73b20a14cde2f04cf7/src/main/java/gnu/io/RXTXPort.java#L955

Is there workaround we could implement in the binding side?

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/modbus-binding-problem-with-serial-connection/117942/7

ssalonen commented 3 years ago

@fwolter could you enlighten whether the following workaround would help in the binding?

Current behaviour

New behaviour

ssalonen commented 3 years ago

Comment from user in community https://community.openhab.org/t/modbus-binding-problem-with-serial-connection/117942/8?u=ssalonen shows how frustrating the nrjavaserial issue is in practice:

I read all those threads. I gave up with this binding. It’s not usable.

I have written my own rules to query the modbus unit using the modpoll command line utility. Works perfectly and absolutely reliable 100%.

fwolter commented 3 years ago

@fwolter could you enlighten whether the following workaround would help in the binding?

Current behaviour

* when I/O error is encountered, serial port is closed and re-opened before next retry

New behaviour

* when I/O error is encountered, serial port is kept open

Could be, but I often observed the error right after the start of the JVM.

I agree that this problem is very annoying. A "quick" solution would be to disable the locks in general.

ssalonen commented 3 years ago

Potentially related observation with other binding (DSMR) https://community.openhab.org/t/serial-communication-stops-working-oh3/117203

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/smartmeter-binding-throws-unable-to-open-port-after-a-while/94349/20

MikeTheTux commented 3 years ago

I'm also affected by this problem when using the SmartMeter binding.

I'm running a total of 4 USB2Serial devices, 3 FTDI, 1 cp210x. The 3 FTDI from OH3.0.1, the single cp210x via zigbee2mqtt.

mike@nas:~$ usb-devices
T:  Bus=01 Lev=02 Prnt=03 Port=00 Cnt=01 Dev#=  9 Spd=12  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=0403 ProdID=6001 Rev=06.00
S:  Manufacturer=FTDI
S:  Product=AMB8465 AMBER USB Stick
S:  SerialNumber=Www
C:  #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=90mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=ftdi_sio

T:  Bus=01 Lev=02 Prnt=03 Port=01 Cnt=02 Dev#= 12 Spd=12  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=0403 ProdID=6015 Rev=10.00
S:  Manufacturer=FTDI
S:  Product=FT230X Basic UART
S:  SerialNumber=Xxx
C:  #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=90mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=ftdi_sio

T:  Bus=01 Lev=02 Prnt=03 Port=02 Cnt=03 Dev#= 13 Spd=12  MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=10c4 ProdID=ea60 Rev=01.00
S:  Manufacturer=Silicon Labs
S:  Product=slae.sh cc2652rb stick - slaesh's iot stuff
S:  SerialNumber=Yyy
C:  #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=100mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=00 Prot=00 Driver=cp210x

T:  Bus=01 Lev=01 Prnt=01 Port=02 Cnt=03 Dev#= 11 Spd=12  MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=0403 ProdID=6001 Rev=06.00
S:  Manufacturer=FTDI
S:  Product=AMB8465-M AMBER USB Stick
S:  SerialNumber=Zzz
C:  #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=90mA
I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=ftdi_sio

Can I provide some debugging info in order to nail down the root cause?

If blocked, the only solution that I found to unblock the serial is via a restart of Openhab. Sometimes I even need to do some magic on the interface inbetween the restart. Is there a better way, without a restart of OH?

If the issue is not found, I need to go back to my previous, stable solution: read SmartMeter out via an external script and pump the data into OH3 via MQTT.

digitaldan commented 3 years ago

So I am seeing this as well now in the Z-wave binding on our nightlies, its suddenly stops working, hanging on this line:

2021-04-16 11:52:46.858 [DEBUG] [g.openhab.binding.zwave.handler.ZWaveSerialHandler] - Connecting to serial port '/dev/ttyACM0'

but i can't figure out why, other then the serial thread looks to be waiting here:

    at java.base@11.0.9/java.lang.Thread.sleep(Native Method)
    at gnu.io.RXTXPort.waitForTheNativeCodeSilly(RXTXPort.java:958)
    at gnu.io.RXTXPort.<init>(RXTXPort.java:180)
    at gnu.io.RXTXCommDriver.getCommPort(RXTXCommDriver.java:983)
    at gnu.io.CommPortIdentifier.open(CommPortIdentifier.java:467)
    at org.openhab.core.io.transport.serial.internal.SerialPortIdentifierImpl.open(SerialPortIdentifierImpl.java:53)
    at org.openhab.binding.zwave.handler.ZWaveSerialHandler.watchSerialPort(ZWaveSerialHandler.java:108)
    at org.openhab.binding.zwave.handler.ZWaveSerialHandler$$Lambda$1283/0x00000001011f7440.run(Unknown Source)
    at java.base@11.0.9/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base@11.0.9/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base@11.0.9/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base@11.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base@11.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base@11.0.9/java.lang.Thread.run(Thread.java:834)

Whats odd, is that if i install a fresh system, so no configuration, and add the zwave controller it seems ok, but at some point after the system runs for a while, the binding stops being able to connect after a restart. The problem i'm having is that it will no longer connect again after any amount of restarts.

I don't see other bindings competing for the port in the thread stack, so i don't have any explanation why it suddenly stop being able to reconnect, even among restarts. It's as if there is a lock file somewhere, but i don't really think thats the case.

fwolter commented 3 years ago

The probability this error occurs rises with the number of times the serial port is re-opened. I don't know whether the Z-Wave binding re-opens the port on communication errors, but a re-open can also be forced by EMI or bad connection between USB port and stick. You could take a look in the kernel log if there are any re-connect messages.

digitaldan commented 3 years ago

You could take a look in the kernel log if there are any re-connect messages.

No, there's no messages in my system logs (trust me, i have been scouring my system for clues as this has been very frustrating ), and at some point after adding enough bindings back into the mix on a clean install, this happens when OH starts so there no chance for it to reconnect, it never completes a connect in the first place.

I'm also seeing the same message initialise_event_info_struct: initialise failed! in the startup log at the same time the zwave log is printing Connecting to serial port '/dev/ttyACM0' .

I'm still messing around with adding and removing bindings to figure out the tipping point, but i'm guessing this is a race condition that gets triggered by load or binding start order as i don't have any other bindings that use the serial libraries, and it only happens as i add the 15-20 bindings i use back into my configuration.

I have also occasionally seen this message RXTX Warning: Removing stale lock file. /var/lock/LCK..ttyACM0 right before the message initialise_event_info_struct: initialise failed! which leades me to believe nrjavaserial may not be cleaning up properly or possibly crashing at some point ?

Whats very strange is there is nothing else new here on my system, other then using the latest nightly builds.

ssalonen commented 3 years ago

@digitaldan the thinking is actually that the issue is with nrjavaserial alone. All you need is one binding using serial.

The way nrjavaserial is compiled for OH3 is different compared to OH2, something that might trigger this.

https://github.com/openhab/openhab-core/issues/1842#issue-748339647

I suspect that all bindings that use serial suffer from this. Since the root cause is quite intractable, I personally introduced a workaround for modbus binding that seems to help, https://github.com/ssalonen/openhab-core/commit/1dc4094b770242c626d784c3482552cad729b59. Basically the workaround is that we avoid close/reconnect cycle with the serial port. Instead, we just keep the port open no matter what.

fwolter commented 3 years ago

it only happens as i add the 15-20 bindings i use back into my configuration.

My observation is, that it doesn't seem to depend on the number of bindings. I suffered from the issue with only two bindings installed (modbus and a custom) after every reboot on some days. So, the system was unsable. On other days, it worked after a reboot reproducibly for hours, days or sometimes even weeks, without changing anything.

My setup works flawlessly for months now, after disabling the locks in nrjavaserial.

digitaldan commented 3 years ago

@digitaldan the thinking is actually that the issue is with nrjavaserial alone. All you need is one binding using serial. The way nrjavaserial is compiled for OH3 is different compared to OH2, something that might trigger this.

1842 (comment)

Yeah, thats what i was trying to say, that i don't have any other serial bindings, and yes, of course I read the first comment in this issue, thanks.

My observation is, that it doesn't seem to depend on the number of bindings.

The only way this repeats for me is when my load profile changes, then its nearly 100% constant. This is at least my observation after trying this many, many times. I'm sure its probably different for others. I have no idea why that is, but I can reproduce it reliability.

My setup works flawlessly for months now, after disabling the locks in nrjavaserial.

I'm going to give that a shot as well. I'm a little concerned that this ends up becoming a bigger issue for us now that it looks to affect z-wave and other serial binding users. Was there a bump in the nrjavaserial version we think caused this? It seems to me to be a recent development.

ssalonen commented 3 years ago

Yeah, thats what i was trying to say, that i don't have any other serial bindings, and yes, of course I read the first comment in this issue, thanks.

@digitaldan yeah, did not mean to be snarky -- I was on mobile and missed the fact that it was indeed the first post, hah

Was there a bump in the nrjavaserial version we think caused this? It seems to me to be a recent development.

What do you mean by recent development? The "working hypothesis" is that the issue was introduced in OH3. In OH2, openHAB used a "forked" version of nrjavaserial which had the library locking feature disabled (it's a compilation flag of nrjavaserial). openHAB 3 uses upstream nrjavaserial which has the library locking feature enabled. I presume also the version has been bumped with potential changes, most likely including fixes but potentially also regressions.

I think this is the PR introducing the OH3 "locking behaviour" https://github.com/openhab/openhab-core/pull/1426

With the upgraded version the library will no longer crash the Java 11 VM on Windows.

The native libraries now use the built in RXTX lockfiles instead of the previously used (OS dependent) liblockdev whereas the OH build didn't use any lockfiles at all.

It seems that serial port disconnect/re-connect has been troublesome also before, see e.g. https://github.com/openhab/openhab-core/pull/1503 motivated by https://github.com/openhab/org.openhab.binding.zwave/pull/1338

With this version it will no longer completely exit exclamation the JVM in certain scenarios when ports are closed/reopened.

berlic commented 3 years ago

Hi, I'm in the same boat after upgrading to OH3. Modbus polls hangs after half a day or so. Had no problems with OH2 and the very same hardware.

Not sure if this of any help, but here are some observations.

I noticed a lot of deleted lock files for an openhab process (example):

java    9716 openhab  255u      REG       0,22         0   7911 /run/lock/tmpQtqAl9 (deleted)
java    9716 openhab  256u      REG       0,22         0   7913 /run/lock/tmp8nA4TH (deleted)
java    9716 openhab  257u      REG       0,22         0   7915 /run/lock/tmpBPaDnw (deleted)

When modbus binding hangs the total number of files shown by lsof is slightly more than 1024, so I initially thought that there might be some ulimit issue or smth; but it seems that systemd increases nofile limits for openhab process.

I tried the jar-file from https://github.com/openhab/openhab-core/pull/2272, and it seems that the number of lock-files is not growing. Thanks @ssalonen, I'll try it for a longer run.

After reverting org.openhab.core.io.transport.modbus back to original 3.0.1 version and watching for lsof I see additional set of lock files after every modbus transmission error like this:

2021-04-22 11:01:07.390 [WARN ] [rt.modbus.internal.ModbusManagerImpl] - Try 1 out of 3 failed when executing request (ModbusReadRequestBlueprint@f2a815[slaveId=109,functionCode=READ_MULTIPLE_REGISTERS,start=121,length=1,maxTries=3]). Will try again soon. Error was I/O error, so reseting the connection. Error details: net.wimpi.modbus.ModbusIOException I/O exception: IOException Error reading response (EOF) [operation ID e3a53c0e-b2fb-44b0-87a0-0d31a10bd068]

Doing inotifywait on /run/lock gives this after each error:

10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 ACCESS
10:54:34 LCK..ttyUSB0 CLOSE_NOWRITE,CLOSE
10:54:34 LCK..ttyUSB0 DELETE
10:54:34 tmpvviHME CREATE
10:54:34 tmpvviHME OPEN
10:54:34 tmpvviHME MODIFY
10:54:34 tmpvviHME OPEN
10:54:34 tmpvviHME CLOSE_WRITE,CLOSE
10:54:34 tmpvviHME DELETE
10:54:34 LCK..ttyUSB0 CREATE
10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 MODIFY
10:54:34 LCK..ttyUSB0 CLOSE_WRITE,CLOSE
10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 ACCESS
10:54:34 LCK..ttyUSB0 CLOSE_NOWRITE,CLOSE
10:54:34 LCK..ttyUSB0 DELETE
10:54:34 tmpmrxXDm CREATE
10:54:34 tmpmrxXDm OPEN
10:54:34 tmpmrxXDm MODIFY
10:54:34 tmpmrxXDm OPEN
10:54:34 tmpmrxXDm CLOSE_WRITE,CLOSE
10:54:34 tmpmrxXDm DELETE
10:54:34 LCK..ttyAMA0 CREATE
10:54:34 LCK..ttyAMA0 OPEN
10:54:34 LCK..ttyAMA0 MODIFY
10:54:34 LCK..ttyAMA0 CLOSE_WRITE,CLOSE
10:54:34 LCK..ttyAMA0 OPEN
10:54:34 LCK..ttyAMA0 ACCESS
10:54:34 LCK..ttyAMA0 CLOSE_NOWRITE,CLOSE
10:54:34 LCK..ttyAMA0 DELETE
10:54:34 tmpXU7dB3 CREATE
10:54:34 tmpXU7dB3 OPEN
10:54:34 tmpXU7dB3 MODIFY
10:54:34 tmpXU7dB3 OPEN
10:54:34 tmpXU7dB3 CLOSE_WRITE,CLOSE
10:54:34 tmpXU7dB3 DELETE
10:54:34 LCK..ttyUSB0 CREATE
10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 MODIFY
10:54:34 LCK..ttyUSB0 CLOSE_WRITE,CLOSE
10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 ACCESS
10:54:34 LCK..ttyUSB0 CLOSE_NOWRITE,CLOSE
10:54:34 LCK..ttyUSB0 DELETE
10:54:34 tmpaYqBMK CREATE
10:54:34 tmpaYqBMK OPEN
10:54:34 tmpaYqBMK MODIFY
10:54:34 tmpaYqBMK OPEN
10:54:34 tmpaYqBMK CLOSE_WRITE,CLOSE
10:54:34 tmpaYqBMK DELETE
10:54:34 LCK..ttyAMA0 CREATE
10:54:34 LCK..ttyAMA0 OPEN
10:54:34 LCK..ttyAMA0 MODIFY
10:54:34 LCK..ttyAMA0 CLOSE_WRITE,CLOSE
10:54:34 LCK..ttyAMA0 OPEN
10:54:34 LCK..ttyAMA0 ACCESS
10:54:34 LCK..ttyAMA0 CLOSE_NOWRITE,CLOSE
10:54:34 LCK..ttyAMA0 DELETE
10:54:34 tmpdg0Vcr CREATE
10:54:34 tmpdg0Vcr OPEN
10:54:34 tmpdg0Vcr MODIFY
10:54:34 tmpdg0Vcr OPEN
10:54:34 tmpdg0Vcr CLOSE_WRITE,CLOSE
10:54:34 tmpdg0Vcr DELETE
10:54:34 LCK..ttyUSB0 CREATE
10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 MODIFY
10:54:34 LCK..ttyUSB0 CLOSE_WRITE,CLOSE
10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 ACCESS
10:54:34 LCK..ttyUSB0 CLOSE_NOWRITE,CLOSE
10:54:34 LCK..ttyUSB0 DELETE
10:54:34 tmpKBd8W7 CREATE
10:54:34 tmpKBd8W7 OPEN
10:54:34 tmpKBd8W7 MODIFY
10:54:34 tmpKBd8W7 OPEN
10:54:34 tmpKBd8W7 CLOSE_WRITE,CLOSE
10:54:34 tmpKBd8W7 DELETE
10:54:34 LCK..ttyAMA0 CREATE
10:54:34 LCK..ttyAMA0 OPEN
10:54:34 LCK..ttyAMA0 MODIFY
10:54:34 LCK..ttyAMA0 CLOSE_WRITE,CLOSE
10:54:34 LCK..ttyAMA0 OPEN
10:54:34 LCK..ttyAMA0 ACCESS
10:54:34 LCK..ttyAMA0 CLOSE_NOWRITE,CLOSE
10:54:34 LCK..ttyAMA0 DELETE
10:54:34 tmpZ5hVCS CREATE
10:54:34 tmpZ5hVCS OPEN
10:54:34 tmpZ5hVCS MODIFY
10:54:34 tmpZ5hVCS OPEN
10:54:34 tmpZ5hVCS CLOSE_WRITE,CLOSE
10:54:34 tmpZ5hVCS DELETE
10:54:34 LCK..ttyUSB0 CREATE
10:54:34 LCK..ttyUSB0 OPEN
10:54:34 LCK..ttyUSB0 MODIFY
10:54:34 LCK..ttyUSB0 CLOSE_WRITE,CLOSE

Not sure why it does something with ttyAMA0, as I only use ttyUSB0 for RS485 communication.

MrDOS commented 3 years ago

Hey, as an NRJavaSerial contributor, I feel like I owe you folks my $0.02 on the underlying library issue at play here, and what I intend to do about it. My 👀 response to @fwolter's comment back in November was supposed to indicate that I was aware of the issue, but as time wears on, I think that response comes off as a bit too glib.

I noticed a lot of deleted lock files for an openhab process (example):

NRJavaSerial leaks the file descriptors for its lockfiles. I haven't hunted down where, exactly, but it seems to unlink(2) (or equivalent) the lockfile paths without close(2)'ing the open descriptors. (In your inotifywait output above, you can see that some of the files are opened twice, but only closed once.) As you're finding, this eventually gums up the works for itself and other file access done within the same process. It's been a while since I was last able to look at this, but IIRC, this manifests as an initialization error (initialise_event_info_struct: initialise failed!) and not a file open error due to limitations surrounding select(2) and FD_SETSIZE. The file can open: with an increased ulimit, you can open more than 1,024 (FD_SETSIZE) files. But the files are opened with large file descriptor numbers (>1,024), and so when the library tries to use the serial port, it calls FD_SET() with that larger-than-FD_SETSIZE descriptor, and encounters an error. The library solution is for us to use the newer poll(2) instead of select(2) – and to not leak file descriptors like a sieve in the first place.

This is why building NRJavaSerial with locking disabled circumvents the problem: no leaked lockfiles means no high file descriptor numbers, means select(2) is happy, and means no random initialization crashes.

New behaviour

  • when I/O error is encountered, serial port is kept open

Retaining/reusing old file descriptors (by reusing port objects) will avoid this problem, yes. Unfortunately, it can be really hard to distinguish when this is safe to do so: was the I/O error directly relating to the port hardware, e.g., a USB-to-serial dongle being disconnected, or was it a protocol error/unresponsive device on the far end? If you can rearchitect openHAB to continue to use old ports, you'll still eventually be unable to open new ports, but you can keep using the ones you already have open. It would be a pretty clumsy workaround for a problem you shouldn't have to deal with, but it should help.

Not sure why it does something with ttyAMA0, as I only use ttyUSB0 for RS485 communication.

As part of enumerating available ports ([CommPortIdentifier.getPortIdentifiers()](https://nrjs.org/docs/gnu/io/CommPortIdentifier.html#getPortIdentifiers())), NRJavaSerial tries to open each device in turn to confirm that the port is accessible to the current user. As far as I can tell, there's no valid reason for it to open the port to do the access check instead of just... performing an access check (i.e., via access(2)). But on top of that, it doesn't just open(2) the port – no, it goes through the entire usual rigmarole of locking/unlocking the port, too. So because port locking/unlocking leaks file descriptors, every port enumeration operation leaks file descriptors proportional to the number of serial ports attached to the system.

This is pretty high on the list of things I'd like to fix in NRJavaSerial. Unfortunately, the road to hell being paved with good intentions and all that, I haven't gotten to it yet. Hopefully I can soon.

What does the release cycle for look like, by the way? Do you issue updates on an as-available basis, or is there a release window you try to hit? Obviously you'd like this fixed in NRJavaSerial ASAP, but is there a time (other than “yesterday!”) when it would be optimal to receive a fix by?

fwolter commented 3 years ago

Thanks for your explanations!

port locking/unlocking leaks file descriptors, every port enumeration operation leaks file descriptors proportional to the number of serial ports attached to the system.

IIRC, openHAB also enumerates the serial ports each time you open a configuration UI containing a serial port.

The openHAB minor release cycle is twice a year, I think. In any case, the next release will already be in May.

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/serial-port-dev-ttyusb0-not-found-or-port-is-busy-but-only-periodically/121464/5

jamietownsend commented 3 years ago

@MrDOS is there a NRJavaSerial Bug we can point to from here?

MrDOS commented 3 years ago

Yes, https://github.com/NeuronRobotics/nrjavaserial/issues/111.

samsp99 commented 3 years ago

Are these lock files persistent across reboots? Is there anyway we can force clear them?

jamietownsend commented 3 years ago

No, the files are not persistent across reboots. They are not event persistent across openhab restarts. In fact, the files are unlinked by the java process (so not visible at the OS level), but not released (so still consuming a file handle and visible as "deleted" using "lsof -p ".

As far as I can tell, the only way to release the files is for the java process to do it itself (fix the nrjavaserial bug or perhaps reload a component) or to restart the process (the approach I've taken for now - I looked at the nrjavaserial code, but the bug is in c++ code and it's 20 years since I last programmed in c).

samsp99 commented 3 years ago

Hmm, I had a strange behavior then. The Serial binding had stopped working for me https://community.openhab.org/t/oh3-serial-binding-failing-port-is-not-known/121726/2. I have rebooted multiple times since, and it still doesn't work. I was debugging my own binding, and so stopped openhab via systemctl and restarted with

sudo openhab-cli start --debug

to enable remote java debugging. During startup, there was a message along the lines of: "RXTX stale lock file removed" and then the serial binding started working again. So this told me the lock files do persist beyond a reboot. When you run openhab as root (which is what happens if you start it like above), it can find and destroy those lock files. I am wondering if this is actually a permissions issue that the openhab account doesn't have sufficient access to find and remove the lock files at process startup.

MrDOS commented 3 years ago

So this told me the lock files do persist beyond a reboot.

If the Java process exits without the ports being released (without RXTXPort.close() being called), the lock files will persist on disk. However, when NRJavaSerial opens a port and finds an existing lock file, it checks whether the process which created that lock file still “owns” the lock, and if it doesn't, it deletes the stale lock file (resulting in the log message you're seeing). This message is almost certainly unrelated to the failure in serial bindings you're seeing.

Because your issue is intermittent and persists across restarts, I think it's unrelated to this particular issue. In particular, your logs don't include the incriminating initialise_event_info_struct: initialise failed! message. I'm not qualified to help diagnose openHAB-specific issues, but I'll leave a comment in your other thread.

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/oh3-serial-binding-failing-port-is-not-known/121726/3

jamietownsend commented 3 years ago

I think we're talking about two different kinds of lock files here (but I'm not 100% sure).

On openhabian, there's a lock file called /var/lock/LCK..ttyUSB0. This one is always created by the user that starts openhab. Sometimes, if I try to start the process with the openhabian user, this file can't be opened. Deleting this file (as root) and letting it recreate solves that problem.

The nrjavaserial problem has to do with hundreds of temporary files being created and deleted, but not correctly released by the java process. These files have names of the form /var/lock/tmp, where is a 6-character combination. Like I said, they are deleted, but not correctly released, so you won't see them with ls -al /var/lock, but you will see them with lsof -p <openhab java process id. Here's a short output from mine:

openhabian@openhabian:~ $ sudo lsof -p 11586 | grep "/var/lock/tmp"
java    11586 root   70u      REG              179,2         0   2719 /var/lock/tmpMjGOv0 (deleted)
java    11586 root   98u      REG              179,2         0   2761 /var/lock/tmprjAnKh (deleted)
java    11586 root  101u      REG              179,2         0   1561 /var/lock/tmpZ8yJTQ (deleted)
java    11586 root  102u      REG              179,2         0   2666 /var/lock/tmpUIf51z (deleted)
java    11586 root  104u      REG              179,2         0   2688 /var/lock/tmppPH18l (deleted)
...

On my currrently running instance (since 1.5 hours), there's a total of 128 such deleted files. THESE are the files that are not persisted when the java process stop (and therefore not persistent across restarts

MrDOS commented 3 years ago

These files have names of the form /var/lock/tmp, where is a 6-character combination.

My understanding was that these temporary files were created, written to, and then moved to /var/lock/LCK..name, and I thought that's where the leak was – the descriptor used to write to the lock files wasn't closed. But now that I look at the code (grep for mktemp and mkstemp, because that's the filename format), these files are actually created by check_group_uucp(). This function checks whether the current user has permissions to write to /var/lock. Again, I don't know why this check is done by opening and writing to a file instead of just checking access(2) for W_OK. But now that I know this is where it's being done, the source of the leak is obvious:

    if ( 0 == mkstemp(testLockAbsFileName) )
    {
        free(testLockAbsFileName);
        report_error("check_group_uucp(): mktemp malformed string - \
            should not happen");

        return 1;
    }

According to the man page for mkstemp(3), the function returns a read/write descriptor – but the return value is only error-checked (incorrectly; the check should be for == -1) and then discarded. The filename is then fopen(3)'d for writing, and that handle is correctly closed. I think whoever did this (pre-NRJavaSerial) meant to use mktemp(3), which only creates the file without opening it.

Thanks for triggering that investigation, @jamietownsend. I'll open a PR for this in a jiffy, seeing as it should be a one-character change. Then we'll need to re-test, etc. but it looks pretty straightforward. The other issues I outlined above are still valid, I think – access checking is inefficient, and poll(2) is preferable to select(2) – but probably don't need to be addressed in order to fix this.

MrDOS commented 3 years ago

I'm not sure how easy it is to try a different version of NRJavaSerial in an openHAB installation, but can those affected please try ~this build~ generated from https://github.com/NeuronRobotics/nrjavaserial/pull/216?

MrDOS commented 3 years ago

Belay that: I got the error checking for mktemp(3) wrong in that build (on failure, it doesn't return NULL, but rather modifies the template string to be 0-length). Try this build.

samsp99 commented 3 years ago

On openhabian, there's a lock file called /var/lock/LCK..ttyUSB0. This one is always created by the user that starts openhab. Sometimes, if I try to start the process with the openhabian user, this file can't be opened. Deleting this file (as root) and letting it recreate solves that problem.

That is exactly the problem I was having with the serial binding. Now I know what to look for I can potentially fix it next time. When I looked in /var/lock I couldn't see the file (even after sudo to root), but running openhab as root fixed it. I'm wondering if I can have a script as part of OS startup that will look for and delete the file if it still exists?

I'm glad that this conversation has triggered @MrDOS's investigation if that gets the other locking issue resolved.

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/zwave-and-ipcamera-issue/122209/21

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/z-wave-stick-goes-offline-after-sudo-reboot-from-cron/122321/19

thesebastianf commented 3 years ago

I also face that issue with a few bindings.

Since the mentioned issue above https://github.com/NeuronRobotics/nrjavaserial/pull/216

Is closed in the meanwhile will it Help If a next openhab Version uses a current Version of nrjavaserial.? Is so when will this component be Upgraded?

Best

Skinah commented 3 years ago

@thesebastianf The latest snapshot build is using 5.2.1 of NRJavaSerial which is the very latest version.

msexton187 commented 3 years ago

@Skinah Does 5.2.1 have the fix or will it be in the next release?

MrDOS commented 3 years ago

The fix will be in the next release, v5.3.0.

msexton187 commented 3 years ago

Thanks. I'll retest my openHAB setup once 5.3 makes it into the snapshot release.

On Wed, Jun 23, 2021 at 10:02 AM Sam Coleman @.***> wrote:

The fix will be in the next release, v5.3.0.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openhab/openhab-core/issues/1842#issuecomment-866916322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM4INWIXHKPNKLEKEQLONMDTUHZR7ANCNFSM4T6WHNMA .

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/smartmeter-binding-throws-unable-to-open-port-after-a-while/94349/24

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/modbus-errors-when-upgrading-to-oh3-wierd/120225/55

dds90 commented 3 years ago

The fix will be in the next release, v5.3.0.

Hi @MrDOS , Have you got any idea about the fix release date? Thank you very much

Skinah commented 3 years ago

If you don’t get a reply you should post questions over on the nrjavaserial GitHub project.

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/solved-oh3-zwave-binding-z-wave-serial-controller-aeotec-z-stick-gen5-remains-offline/108182/107

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/z-wave-bridge-offline-after-dist-upgrade-to-debian-bullseye-x86-64/125588/4

farfade commented 3 years ago

If I understand correctly, getting nrjavaserial 5.3 (that is not yet released) embedded into openhab 3 is blocking for updating the Debian underlying OS to stable (Bullseye) if we want to have a chance to get serial stuff (zwave, modbus...) working :( ?

Relying onto nrjavaserial is still a good choice for the serial module of openhab ?

Edit : I've updated Debian to bullseye without a flaw for openhab serial. So forget this post, no link between openhab bugs and debian bullseye update !

msexton187 commented 3 years ago

Several of us are hoping 5.3 fixes our openHAB Debian issue with serial devices. For me it's z-wave not reconnecting after reboot.

On Thu, Aug 19, 2021 at 9:21 AM farfade @.***> wrote:

@kaikreuzer https://github.com/kaikreuzer , if I understand correctly, getting nrjavaserial 5.3 embedded into openhab 3 is blocking for updating the Debian underlying OS to Bullseye if we want to have a chance to get serial stuff working ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openhab/openhab-core/issues/1842#issuecomment-901957807, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM4INWK3A4ENSJ6WC5I4URTT5UHO5ANCNFSM4T6WHNMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/z-wave-controller-offline-after-power-failure/125964/2