Bad implementations of ThingHandler.dispose() that are blocking

andrewfg commented 3 years ago

Note: I am posting this issue under openhab-core rather openhab-addons because it relates either a) to an underlying shutdown issue in the core, or b) to a non compliance of some binding developers to the OH coding rules.

Background: The Velux binding suffers from a particularly sensitive communication issue on shutdown: If the TCP socket between OH and the Velux KLF hub is 'hard killed' the hub goes into a zombie state; and it requires a specific and orderly shutdown sequence to prevent this.

Symptoms: I have been doing some tests, and found out the following…

From the time that the user issues a console command to stop the openHAB service, the OH framework takes a variable time of between 0 and 8 seconds before the Velux binding dispose() method is called.
From the time that Velux dispose() method is called, the actual Velux binding code takes a fairly stable time of 3.5 seconds for it to physically shut down the KLF comms.

I guess that our problem is the variable 0 to 8 seconds delay in step 1. above: – If the delay gets too long then the OH framework will already be trying to tear down its underlying TCP socket system (hard close) before the poor Velux dispose() method has had a chance to finish its soft close.

Obviously the OH framework calls dispose() on all its Things, UI, and other services, in no particular time sequence. Sometimes Velux dispose() gets called early in the sequence and everything is good; but sometimes Thing/service dispose() methods are called before Velux dispose(), and then there can be problems for us.

Particularly problematic are those Things which (against the openHAB coding rules) are blocking for a long time in their dispose() method. e.g. the ZWave binding dispose() blocks for more 5 seconds; the framework gives a warning after 5 seconds, but it probably blocks longer than that, so it is likely to be guilty of a large proportion, if not all, of the ‘problem period’ of 8 seconds…

Solutions: I see two potential solutions..

Force the developers of all bindings which are guilty of breaking the openHAB coding rules, by blocking in their ThingHandler.dispose() methods, to fix their code.
Implement a layer in OH core to de-serialize its calls to ThingHandler.dispose()..

https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1311?u=andrewfg

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1311

andrewfg commented 3 years ago

Implement a layer in OH core to de-serialize its calls to ThingHandler.dispose()..

PS it might be as simple as deferring part of ThingManagerImpl.unregisterAndDisposeHandler() to a separate scheduled task as shown below :) .. but then again, it is probably much more complex than that :(

andrewfg commented 3 years ago

^ @sjsf => ??

fwolter commented 3 years ago

The Velux binding is the only binding I'm aware of that defers a dispose. We had a discusson on that, when the PR was reviewed. The only reason we need this quirk is the crappy Velux firmware. And they seem not willing to fix that. I don't know if it's reasonable to raise a change in the core due to this issue. Although, I'm able to relate...

I saw that you already added a workaround to the zwave binding. I think that's the way to go, fix the bindings which take an unneccessary long time to dispose.

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1319

andrewfg commented 3 years ago

@lolodomo / @gs4711 -- any thoughts?

lolodomo commented 3 years ago

I remember an intesresting discussion with @morph166955 when he said stopping OH3 took a long time to stop. I explained that all bindings using a serial connection takes some time to stop. Unfortunately, I searched the discussion in GitHub during a very long time without any success !

morph166955 commented 3 years ago

@lolodomo That ended up being an issue with jupnp having threads locking up. We resolved that as part of jupnp 2.6. The issue there was that we flooded jupnp with unsubscribe requests on shutdown and had a weird thing happen that we ran out of threads for the reply listener and it caused it to wait for the 10 second timeout and completely miss the reply. The fix was to create the third "hidden" pool on jupnp that the listeners exist on so that there is no way to run out of threads.

lolodomo commented 3 years ago

Can you find that issue? I am interested by what I said which in link with the current issue. I remember I computed the time to stop OH. It took around 25s while more than 2 minutes for you.

morph166955 commented 3 years ago

https://github.com/openhab/openhab-core/issues/1886 Are you referring to this?

lolodomo commented 3 years ago

1886 Are you referring to this?

Yes, thank you. I did not find it because you finally renamed it.

What is interesting, I hope, is that I explained at the beginning of that issue that bindings using a serial connection are taking almost 4s to dispose on a RPI. This is the case for example for the RFXCOM or the Powermax bindings. So are there also concerned by the current "problem" like the Velux binding ?

fwolter commented 3 years ago

My RPI takes 26s to shutdown with 3 serial connections. From the log message "Calling dispose handler for thing" to when the ThingStatus changes to UNINITILIZED takes ~150ms. I can't see these 4s delays and I never encountered these during development, too. Tested with OH 3.0.0. Is this issue due to the updated NRJavaserial fixed?

lolodomo commented 3 years ago

I can't see these 4s delays and I never encountered these during development, too.

Maybe it depends on binding implementation. Here are the logs I produced in December 2020 using an OH3 snapshot. Note that the two binding use the same pattern to close the serial connection (and its reading thread).

21:10:24.580 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'powermax:serial:maison' changed from ONLINE to UNINITIALIZED
21:10:24.588 [DEBUG] [nternal.handler.PowermaxBridgeHandler] - Handler disposed for thing powermax:serial:maison
21:10:24.593 [DEBUG] [nal.connector.PowermaxSerialConnector] - close(): Closing Serial Connection
21:10:27.600 [DEBUG] [.internal.connector.PowermaxConnector] - cleanup(): cleaning up Connection
21:10:27.612 [DEBUG] [ternal.connector.PowermaxReaderThread] - Data listener stopped
21:10:27.619 [DEBUG] [.internal.connector.PowermaxConnector] - cleanup(): Connection Cleanup
21:10:28.432 [DEBUG] [nal.connector.PowermaxSerialConnector] - close(): Serial Connection closed
21:10:28.437 [DEBUG] [nternal.handler.PowermaxBridgeHandler] - closeConnection(): disconnected
21:10:28.454 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'powermax:serial:maison' changed from UNINITIALIZED to UNINITIALIZED (HANDLER_MISSING_ERROR)

21:10:28.543 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'rfxcom:bridge:rfx' changed from ONLINE to UNINITIALIZED
21:10:28.550 [DEBUG] [.internal.handler.RFXComBridgeHandler] - Handler disposed.
21:10:28.558 [DEBUG] [ernal.connector.RFXComSerialConnector] - Disconnecting
21:10:31.565 [DEBUG] [ernal.connector.RFXComSerialConnector] - Serial port event listener stopped
21:10:31.570 [DEBUG] [ernal.connector.RFXComSerialConnector] - Interrupt serial listener
21:10:31.578 [DEBUG] [internal.connector.RFXComStreamReader] - Data listener stopped
21:10:31.583 [DEBUG] [ernal.connector.RFXComSerialConnector] - Close serial out stream
21:10:31.588 [DEBUG] [ernal.connector.RFXComSerialConnector] - Close serial in stream
21:10:31.592 [DEBUG] [ernal.connector.RFXComSerialConnector] - Close serial port
21:10:31.872 [DEBUG] [ernal.connector.RFXComSerialConnector] - Closed
21:10:31.887 [INFO ] [ome.event.ThingStatusInfoChangedEvent] - Thing 'rfxcom:bridge:rfx' changed from UNINITIALIZED to UNINITIALIZED (HANDLER_MISSING_ERROR)

But maybe my messages are off topic considering @andrewfg concerns.

andrewfg commented 3 years ago

maybe my messages are off topic considering @andrewfg concerns.

I don't think so. The issue is that if one binding takes too long in its ThingHandler.dispose() method, then it may starve other bindings of sufficient time to do their own work.

The trick is that if you have a slow ThingHandler.dispose() method, then you should defer the slow parts of its code to a scheduler thread so that all bindings' ThingHandler.dispose() methods will be called quickly in turn, and then each binding can take its leisure in actually executing its slow dispose() code on its own thread.

fwolter commented 3 years ago

I think we should find what causes the dispose methods of some bindings to take so long. I don't know if running the dispose code in a scheduler is a good solution.

@lolodomo can you reproduce this with the current OH version?

andrewfg commented 3 years ago

^ The framework does report a log.warn if a binding takes more than 5000mS (but it doesn't tell you exactly how much more than 5000mS). But the problem is that nobody really bothers to report such a binding, because log.warn isn't really function threatening..

lolodomo commented 3 years ago

I will try again but considering my old logs what takes around 3s is the call to serialPort.removeEventListener() at this line: https://github.com/openhab/openhab-addons/blob/main/bundles/org.openhab.binding.rfxcom/src/main/java/org/openhab/binding/rfxcom/internal/connector/RFXComSerialConnector.java#L97

Which finally call the same method of the RxTx serial port: https://github.com/openhab/openhab-core/blob/main/bundles/org.openhab.core.io.transport.serial.rxtx/src/main/java/org/openhab/core/io/transport/serial/rxtx/RxTxSerialPort.java#L88

Same in the powermax binding.

This call is done by many bindings using a serial connection.

kaikreuzer commented 3 years ago

I think we should find what causes the dispose methods of some bindings to take so long.

Yes, that should indeed be avoided - dispose() is expected to return quickly, so we should take care of this during reviews.

I don't know if running the dispose code in a scheduler is a good solution.

I'd think this should be fine. We have the scheduler in the ThingHandlers, so submitting a (final) task to it is the obvious thing to do here.

The framework does report a log.warn if a binding takes more than 5000ms

Yeah, that's our safety guard that we'd build in. Besides the warning, it will actually also make sure that dispose is further executed, while already proceeding with the next handlers, so that a single binding cannot fully bog down the system.

andrewfg commented 3 years ago

I'd think this should be fine. We have the scheduler in the ThingHandlers, so submitting a (final) task to it is the obvious thing to do here.

@kaikreuzer shall I prepare a PR for this?

gs4711 commented 3 years ago

you should defer the slow parts of its code to a scheduler thread so that all bindings' ThingHandler.dispose() methods will be called quickly in turn

Sounds reasonable for me. Then, all bindings have the same opportunity for a graceful termination.

openhab-bot commented 3 years ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/velux-new-openhab2-binding-feedback-welcome/32926/1321

kaikreuzer commented 3 years ago

shall I prepare a PR for this?

@andrewfg If you could briefly explain what you refer to as "this", I'll happily answer! 😎

andrewfg commented 3 years ago

submitting a (final) task to it is the obvious thing to do here.

Actually submitting a final taskS that call dispose() on all ThingHandlers

kaikreuzer commented 3 years ago

Then you've misunderstood me. dispose() must not be called by the thing handlers or any scheduled task, it is only called by the framework. What I meant was that within dispose, a final task can be submitted, which keeps running doing its things it has to do and still returning dispose quickly.

andrewfg commented 3 years ago

No, I think you misunderstood me. What I meant is that whereas currently the framework iterates over all ThingHandlers and directly calls dispose() on each, it should instead submit{ dispose() } to its own executor service on each. This means that all the dispose() methods would start executing in parallel rather than in series.

fwolter commented 3 years ago

If you implement it like this, the thing life cycle state will change to UNINITIALIZED immediately, although the async dispose() may still be executing. That can lead to unwanted side effects. For example if the thing status is changed to INITIALIZING immediately after it has reached UNINITIALIZED, the serial port will be opened again although it hasn't been closed, yet. Of course, this should be handled by the binding's initilization code by trying again, but there might be other scenarios as well.

I think it makes things easier for binding developers if it is guaranteed that dispose() and initialize() cannot run concurrently.

lolodomo commented 3 years ago

I think it makes things easier for binding developers if it is guaranteed that dispose() and initializing() cannot run concurrently.

Oh yes, I agree

J-N-K commented 2 years ago

@andrewfg My impression is that we all agree that this should be fixed in the respective thing handler. Do you think we can close this issue?

andrewfg commented 2 years ago

Do you think we can close this issue?

Ok.

this should be fixed in the respective thing handler

For the record, below is an example of how binding authors should do it in their thing handlers..

@Override
public void dispose() {
    scheduler.submit(() -> {
        disposeSchedulerJob();
    });
}

private void disposeSchedulerJob() {
    // shut down process that takes a long time..
}

J-N-K commented 2 years ago

Thanks. I just noticed that this is also part of the developer documentation: https://www.openhab.org/docs/developer/bindings/#shutdown

openhab / openhab-core

Bad implementations of ThingHandler.dispose() that are blocking #2454

1886 Are you referring to this?