Closed andrewfg closed 2 years ago
This issue has been mentioned on openHAB Community. There might be relevant details there:
https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1311
- Implement a layer in OH core to de-serialize its calls to ThingHandler.dispose()..
PS it might be as simple as deferring part of ThingManagerImpl.unregisterAndDisposeHandler()
to a separate scheduled task as shown below :) .. but then again, it is probably much more complex than that :(
^ @sjsf => ??
The Velux binding is the only binding I'm aware of that defers a dispose. We had a discusson on that, when the PR was reviewed. The only reason we need this quirk is the crappy Velux firmware. And they seem not willing to fix that. I don't know if it's reasonable to raise a change in the core due to this issue. Although, I'm able to relate...
I saw that you already added a workaround to the zwave binding. I think that's the way to go, fix the bindings which take an unneccessary long time to dispose.
This issue has been mentioned on openHAB Community. There might be relevant details there:
https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1319
@lolodomo / @gs4711 -- any thoughts?
I remember an intesresting discussion with @morph166955 when he said stopping OH3 took a long time to stop. I explained that all bindings using a serial connection takes some time to stop. Unfortunately, I searched the discussion in GitHub during a very long time without any success !
@lolodomo That ended up being an issue with jupnp having threads locking up. We resolved that as part of jupnp 2.6. The issue there was that we flooded jupnp with unsubscribe requests on shutdown and had a weird thing happen that we ran out of threads for the reply listener and it caused it to wait for the 10 second timeout and completely miss the reply. The fix was to create the third "hidden" pool on jupnp that the listeners exist on so that there is no way to run out of threads.
Can you find that issue? I am interested by what I said which in link with the current issue. I remember I computed the time to stop OH. It took around 25s while more than 2 minutes for you.
https://github.com/openhab/openhab-core/issues/1886 Are you referring to this?
1886 Are you referring to this?
Yes, thank you. I did not find it because you finally renamed it.
What is interesting, I hope, is that I explained at the beginning of that issue that bindings using a serial connection are taking almost 4s to dispose on a RPI. This is the case for example for the RFXCOM or the Powermax bindings. So are there also concerned by the current "problem" like the Velux binding ?
My RPI takes 26s to shutdown with 3 serial connections. From the log message "Calling dispose handler for thing" to when the ThingStatus changes to UNINITILIZED takes ~150ms. I can't see these 4s delays and I never encountered these during development, too. Tested with OH 3.0.0. Is this issue due to the updated NRJavaserial fixed?
I can't see these 4s delays and I never encountered these during development, too.
Maybe it depends on binding implementation. Here are the logs I produced in December 2020 using an OH3 snapshot. Note that the two binding use the same pattern to close the serial connection (and its reading thread).
21:10:24.580 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'powermax:serial:maison' changed from ONLINE to UNINITIALIZED
21:10:24.588 [DEBUG] [nternal.handler.PowermaxBridgeHandler] - Handler disposed for thing powermax:serial:maison
21:10:24.593 [DEBUG] [nal.connector.PowermaxSerialConnector] - close(): Closing Serial Connection
21:10:27.600 [DEBUG] [.internal.connector.PowermaxConnector] - cleanup(): cleaning up Connection
21:10:27.612 [DEBUG] [ternal.connector.PowermaxReaderThread] - Data listener stopped
21:10:27.619 [DEBUG] [.internal.connector.PowermaxConnector] - cleanup(): Connection Cleanup
21:10:28.432 [DEBUG] [nal.connector.PowermaxSerialConnector] - close(): Serial Connection closed
21:10:28.437 [DEBUG] [nternal.handler.PowermaxBridgeHandler] - closeConnection(): disconnected
21:10:28.454 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'powermax:serial:maison' changed from UNINITIALIZED to UNINITIALIZED (HANDLER_MISSING_ERROR)
21:10:28.543 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'rfxcom:bridge:rfx' changed from ONLINE to UNINITIALIZED
21:10:28.550 [DEBUG] [.internal.handler.RFXComBridgeHandler] - Handler disposed.
21:10:28.558 [DEBUG] [ernal.connector.RFXComSerialConnector] - Disconnecting
21:10:31.565 [DEBUG] [ernal.connector.RFXComSerialConnector] - Serial port event listener stopped
21:10:31.570 [DEBUG] [ernal.connector.RFXComSerialConnector] - Interrupt serial listener
21:10:31.578 [DEBUG] [internal.connector.RFXComStreamReader] - Data listener stopped
21:10:31.583 [DEBUG] [ernal.connector.RFXComSerialConnector] - Close serial out stream
21:10:31.588 [DEBUG] [ernal.connector.RFXComSerialConnector] - Close serial in stream
21:10:31.592 [DEBUG] [ernal.connector.RFXComSerialConnector] - Close serial port
21:10:31.872 [DEBUG] [ernal.connector.RFXComSerialConnector] - Closed
21:10:31.887 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'rfxcom:bridge:rfx' changed from UNINITIALIZED to UNINITIALIZED (HANDLER_MISSING_ERROR)
But maybe my messages are off topic considering @andrewfg concerns.
maybe my messages are off topic considering @andrewfg concerns.
I don't think so. The issue is that if one binding takes too long in its ThingHandler.dispose()
method, then it may starve other bindings of sufficient time to do their own work.
The trick is that if you have a slow ThingHandler.dispose()
method, then you should defer the slow parts of its code to a scheduler thread so that all bindings' ThingHandler.dispose()
methods will be called quickly in turn, and then each binding can take its leisure in actually executing its slow dispose()
code on its own thread.
I think we should find what causes the dispose methods of some bindings to take so long. I don't know if running the dispose code in a scheduler is a good solution.
@lolodomo can you reproduce this with the current OH version?
^ The framework does report a log.warn if a binding takes more than 5000mS (but it doesn't tell you exactly how much more than 5000mS). But the problem is that nobody really bothers to report such a binding, because log.warn isn't really function threatening..
I will try again but considering my old logs what takes around 3s is the call to serialPort.removeEventListener() at this line: https://github.com/openhab/openhab-addons/blob/main/bundles/org.openhab.binding.rfxcom/src/main/java/org/openhab/binding/rfxcom/internal/connector/RFXComSerialConnector.java#L97
Which finally call the same method of the RxTx serial port: https://github.com/openhab/openhab-core/blob/main/bundles/org.openhab.core.io.transport.serial.rxtx/src/main/java/org/openhab/core/io/transport/serial/rxtx/RxTxSerialPort.java#L88
Same in the powermax binding.
This call is done by many bindings using a serial connection.
I think we should find what causes the dispose methods of some bindings to take so long.
Yes, that should indeed be avoided - dispose()
is expected to return quickly, so we should take care of this during reviews.
I don't know if running the dispose code in a scheduler is a good solution.
I'd think this should be fine. We have the scheduler
in the ThingHandlers, so submitting a (final) task to it is the obvious thing to do here.
The framework does report a log.warn if a binding takes more than 5000ms
Yeah, that's our safety guard that we'd build in. Besides the warning, it will actually also make sure that dispose is further executed, while already proceeding with the next handlers, so that a single binding cannot fully bog down the system.
I'd think this should be fine. We have the scheduler in the ThingHandlers, so submitting a (final) task to it is the obvious thing to do here.
@kaikreuzer shall I prepare a PR for this?
you should defer the slow parts of its code to a scheduler thread so that all bindings'
ThingHandler.dispose()
methods will be called quickly in turn
Sounds reasonable for me. Then, all bindings have the same opportunity for a graceful termination.
This issue has been mentioned on openHAB Community. There might be relevant details there:
https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1321
shall I prepare a PR for this?
@andrewfg If you could briefly explain what you refer to as "this", I'll happily answer! 😎
submitting a (final) task to it is the obvious thing to do here.
Actually submitting a final taskS that call dispose() on all ThingHandlers
Then you've misunderstood me. dispose() must not be called by the thing handlers or any scheduled task, it is only called by the framework. What I meant was that within dispose, a final task can be submitted, which keeps running doing its things it has to do and still returning dispose quickly.
No, I think you misunderstood me. What I meant is that whereas currently the framework iterates over all ThingHandlers and directly calls dispose() on each, it should instead submit{ dispose() } to its own executor service on each. This means that all the dispose() methods would start executing in parallel rather than in series.
If you implement it like this, the thing life cycle state will change to UNINITIALIZED immediately, although the async dispose()
may still be executing. That can lead to unwanted side effects. For example if the thing status is changed to INITIALIZING immediately after it has reached UNINITIALIZED, the serial port will be opened again although it hasn't been closed, yet. Of course, this should be handled by the binding's initilization code by trying again, but there might be other scenarios as well.
I think it makes things easier for binding developers if it is guaranteed that dispose()
and initialize()
cannot run concurrently.
I think it makes things easier for binding developers if it is guaranteed that
dispose()
andinitializing()
cannot run concurrently.
Oh yes, I agree
@andrewfg My impression is that we all agree that this should be fixed in the respective thing handler. Do you think we can close this issue?
Do you think we can close this issue?
Ok.
this should be fixed in the respective thing handler
For the record, below is an example of how binding authors should do it in their thing handlers..
@Override
public void dispose() {
scheduler.submit(() -> {
disposeSchedulerJob();
});
}
private void disposeSchedulerJob() {
// shut down process that takes a long time..
}
Thanks. I just noticed that this is also part of the developer documentation: https://www.openhab.org/docs/developer/bindings/#shutdown
Note: I am posting this issue under
openhab-core
ratheropenhab-addons
because it relates either a) to an underlying shutdown issue in the core, or b) to a non compliance of some binding developers to the OH coding rules.Background: The Velux binding suffers from a particularly sensitive communication issue on shutdown: If the TCP socket between OH and the Velux KLF hub is 'hard killed' the hub goes into a zombie state; and it requires a specific and orderly shutdown sequence to prevent this.
Symptoms: I have been doing some tests, and found out the following…
From the time that the user issues a console command to stop the openHAB service, the OH framework takes a variable time of between 0 and 8 seconds before the Velux binding dispose() method is called.
From the time that Velux dispose() method is called, the actual Velux binding code takes a fairly stable time of 3.5 seconds for it to physically shut down the KLF comms.
I guess that our problem is the variable 0 to 8 seconds delay in step 1. above: – If the delay gets too long then the OH framework will already be trying to tear down its underlying TCP socket system (hard close) before the poor Velux dispose() method has had a chance to finish its soft close.
Obviously the OH framework calls dispose() on all its Things, UI, and other services, in no particular time sequence. Sometimes Velux dispose() gets called early in the sequence and everything is good; but sometimes Thing/service dispose() methods are called before Velux dispose(), and then there can be problems for us.
Particularly problematic are those Things which (against the openHAB coding rules) are blocking for a long time in their dispose() method. e.g. the ZWave binding dispose() blocks for more 5 seconds; the framework gives a warning after 5 seconds, but it probably blocks longer than that, so it is likely to be guilty of a large proportion, if not all, of the ‘problem period’ of 8 seconds…
Solutions: I see two potential solutions..
Force the developers of all bindings which are guilty of breaking the openHAB coding rules, by blocking in their ThingHandler.dispose() methods, to fix their code.
Implement a layer in OH core to de-serialize its calls to ThingHandler.dispose()..
https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1311?u=andrewfg