[openthermgateway] Gateway locks up if commands are sent too fast

andrewfg commented 2 years ago

As described in the Domotica Forum in this thread there may occasionally be a problem if the binding sends several commands to the gateway in quick succession.

This problem does not always occur. It only occurs very infrequently if/when a certain combination of events coincide "the perfect storm" (see the above-mentioned thread for more details). However when it does occur, it causes a stack over-flow in the gateway, which causes it to lock up, and requiring either power cycling, or hitting the reset button.

Note: I am posting this issue as a place holder, but I am already working on (and testing) a fix in the binding code. I will open the respective PR when it is ready.

andrewfg commented 2 years ago

Ping @ArjenKorevaar for info..

ArjenKorevaar commented 2 years ago

Did the binding not have one single thread to read and write to the gateway ? And if I understand correctly, this issue would not occur if one single thread is used to both read and write to the socket ?

andrewfg commented 2 years ago

Did the binding not have one single thread to read and write to the gateway?

Almost! But not entirely. :)

The original sendCommand() method could have been called either a) from within the polling task's thread, or b) from the Thing handler (which runs on another OH framework thread). My recent PR added a synchronized wrapper on that method to prevent thread conflicts within the same method, but it could still have had reads and writes happening in parallel on separate threads.

I am already working on a solution whereby the pendingCommands object is converted from a Map to a FIFO List. The sendCommand() method just appends the command to the end of the FIFO (so it can be called on whatever thread). And the polling task's thread checks the FIFO list, writes the FIFO head command to the socket, and removes it from the FIFO when the respective 'ack' is received. If there is more than one command in the FIFO, the second and subsequent commands are not sent until the prior command has been ack'ed and removed from the FIFO. Furthermore, as you had it before, if the 'ack' does not come after a certain time, it tries to re-send the FIFO head command until a max timeout has expired.

I did not open a new PR yet, but you can review my revised version of OpenThermGatewaySocketConnector.java on my REPO HERE

I will test this code for quite some time to make sure that it has not caused any regressions.

andrewfg commented 2 years ago

^ I forgot to answer your question about whether running read and write on the same thread would solve the problem: The answer is no. The issue is actually not about serialising reads and writes, but rather about ensuring that the OTGW has completed execution of one command before sending it a subsequent command. If the OTGW is asked to start a command before it has finished a prior command, that can lead to a stack overflow. So the solution is to delay sending a second command until the ack for the prior command has been received.

ArjenKorevaar commented 2 years ago

What situations would lead to messages being sent that frequently? Misconfigured rules? There is a risk that the queue just keeps filling up as new messages are added faster than sent messages are removed after an ack. We should handle that some way. Especially since we noticed in the past that messages do not always get an ack, which is why we added logic to resend a message if after some time no ack is received.

andrewfg commented 2 years ago

What situations would lead to messages being sent that frequently?

OTGW firmware v5.2 introduced the function "Limit the validity of a CS command. A remotely set control setpoint expires after just over a minute" (see https://www.otgw.tclcode.com/changelog.html) which means that you have to configure the rules to resend the CS command every 30 seconds (say).

My actual rule sends CS=nn.nn followed immediately by CH=1 every 30 seconds. With the current released version of the binding, the trace log is as follows. In this case, the CS command takes 32 milli-seconds to execute before its 'ack' comes back. That means that the OH rule had already sent the CH write command before the CS command had completed. The hypothesis is that such sending of a second command before the prior command has completed causes (in infrequent cases) the OTGW stack to overrun, and thus go into lock up.

2022-03-09 15:19:10.418 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Sent: CS=50.0
2022-03-09 15:19:10.429 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Sent: CH=1
2022-03-09 15:19:10.450 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Read: CS: 50.00
2022-03-09 15:19:10.476 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Read: CH: 1

In the binding version that I am currently testing, there is a FIFO queue, which ensures that a new command will not be sent until the 'ack' has been read for the prior command. The log now looks as follows. As you can see, there is no longer any overlap of commands.

2022-03-11 17:11:50.573 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Sent: CS=55.0
2022-03-11 17:11:50.600 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Read: CS: 55.00
2022-03-11 17:11:50.601 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Sent: CH=1
2022-03-11 17:11:50.622 [TRACE] [rnal.OpenThermGatewaySocketConnector] - Read: CH: 1

With the current released version of the binding, with such overlap, I encountered 3 lock ups in about two weeks. So I intend to let the new testing version run for the same length of time to confirm that the problem is solved.

risk that the queue just keeps filling up as new messages are added faster than sent messages are removed after an ack. messages do not always get an ack, which is why we added logic to resend a message if after some time no ack is received.

Yes, I fully agree with both your above statements. The FIFO queue is constructed so that the following happens..

Any command that did not receive an 'ack' within 100 milli-seconds is sent again.
Any command that did not receive an 'ack' within 5 seconds is deleted from the FIFO queue.
Note: this logic is indeed exactly the same as your own logic used in the current released version of the binding.

andrewfg commented 2 years ago

^ PS there is also a hard check so that if the FIFO buffer reaches a certain size, additional commands are refused and a logger.warn() message is recorded. Normally the > 5 seconds expiry check removes stale commands, so the hard check will only kick in and log the warning if the buffer reaches the size limit in under 5 seconds i.e. in cases where an OH rule might be mis-behaving.

ArjenKorevaar commented 2 years ago

so the hard check will only kick in and log the warning if the buffer reaches the size limit in under 5 seconds

Not neccesarily I think, the queue may in theory fill up slowly if new commands are added just a tiny bit faster than acks are received. It may then take while before the queue reaches the limit. But ok, as long as there is a mechanism in place that checks the size of the queue, then it should be fine. All in all, seems a good solution to me.

andrewfg commented 2 years ago

seems a good solution to me.

:)

I will let it run on my operative OH system for a week or two; and if all is Ok I will open a PR.

PS extra: when the socket connects it sends one PR=A command; so I added a callback to save its 'ack' response as a property of the gateway Thing.

openhab / openhab-addons

[openthermgateway] Gateway locks up if commands are sent too fast #12457