tagyoureit / nodejs-poolController

An application to control pool equipment from various manufacturers.
GNU Affero General Public License v3.0
327 stars 97 forks source link

Intellicenter setOptionsAsync() is not reliable #222

Closed george-basco closed 4 years ago

george-basco commented 4 years ago

I assume this is by design, but given that options are set without the use of: response: IntelliCenterBoard.getAckResponse(168) the result is that setting options can fail, leaving njspc out of sync with the IC.

Ideally any state change requests to the IC would be reliable. If not possible, would like to understand why.

I can provide examples, but assuming there's no need for that. Thanks..

rstrouse commented 4 years ago

This is the same issue that Pentair faces with options. The ACK responses for options aren't 1 for 1. For instance each 168 isn't followed by an ACK[168]. If you send 2 in a row it often only sends back one ACK and in some occurrences none at all even when it succeeded. It gave me great heartburn and explained the weirdness on the web app as well when you for instance change more than one option at a time. When we coded this originally, the result was a returned error even though it succeeded so we decided to fall forward with it.

However, with the new firmware perhaps this has been fixed. Event if it hasn't we also did a great deal of work on the resends so it can be persistent in forcing an ACK without becoming a nuisance. The solution might simply be to beat it like a dog until it gives us the response we want. We have had to do that in other instances as well. Most notably when it comes to group settings.

rstrouse commented 4 years ago

I added the ACK[168] processing into the options messages. It did however start dumping configuration on the first message but the code is persistent in processing the data until all options are set. We'll see how this goes.

Uggh I just got it to fail by feeding too many message to it.

george-basco commented 4 years ago

Your latest code with the ACK processing works well when you only change one option. However, if you change multiple options, it seems to always hang after the first 168 goes out and ACK comes back (dashPanel gets stuck on "Saving Delays"). Something about queuing multiple output messages with ACK processing doesn't seem to work correctly. I can give more detail on this behavior if it's not easy to produce on your side. Was wondering if there are other places where this happens successfully?

If you stick with only a single change per request, things work well. I scripted multiple successive PUT's (with no delays between them) through the web interface to mimic the multiple option changes, and that all works great, meaning ACKs mostly come back correctly, and failures get retried, all with no hanging.

Some other observations. Making requests from the ICP, I see that every 168 action seems to be ACK'ed. I haven't found a way on the ICP to cause multiple 168 requests like you see in a "save delays" in dashPanel. This is mostly because the 168's go out immediately when you make a change in the UI. I.e. changes aren't bundled with a "save changes" button/action. Also noticed that I wasn't able to cause any invalid packets to show up when making changes via ICP. Of course I can't create the packets as quickly with the ICP, but creating 168 requests at similar rates with ICP vs. njspc, there's seems to be a difference in resulting invalid packets. Is the ICP able to avoid collisions on the bus? No clue if this is a real difference and if so, why.

rstrouse commented 4 years ago

Yeah I need to take a deep dive into this. The problem is that it starts sending Action 30s between the ACKs. I think it has to do with the source byte.

rstrouse commented 4 years ago

Alright so I figured out what is going on here. Turns out that each message has to be atomic. It needs to send the message then process all the returns. These are from the other controllers on the bus. The code currently puts together a synchronous message queue then processes them all at once. Sadly the panels sometimes echo the command that was just sent.

I am restructuring the queue to be asynchronous and read the messages in between the send/re-sends. I’ll post up some changes tomorrow that changes all the places where this technique is employed.

rstrouse commented 4 years ago

Reworked all the Promise.all constructs for IntelliCenter. Testing shows that all action 168s are processed and the ACK(168) comes in one-for-one. Previously, the configuration requests would bowl over the responses. This prompted the other control panels on the bus to begin requesting information before the process had completed. It would even cause the panels to emit their own 168s under some circumstances.

rstrouse commented 4 years ago

@george-basco let me know if the changes work for you as well. I haven't had a failure and the messages seem to all stream like ducks in a row.

george-basco commented 4 years ago

Your changes have made things much better and generally everything works great. However, there are still edge cases that aren't handled correctly. In my testing I would turn off or turn on all three "delay" flags. As I said the new code handles errors much better, and most of the time retries are successful and everything ends up in a consistent state.

Here are two cases where I first turn on all three flags, followed by turning them off.

image

image

Looking at the second screenshot: The turning-on 168's all go through with no issue. The turning-off 168's don't go so well. The first one doesn't get ack'ed and the two subsequent retries also are not ack'ed and we see invalid packets show up.

The PUT /config/general returns 200 OK with the following data:

{"options":{"units":0,"adjustDST":true,"clockSource":"internet","clockMode":24,"pumpDelay":true,"cooldownDelay":true,"manualPriority":true,"manualHeat":false}

So this all looks correct in that the flags are still true indicating that it did not succeed in turning them off. However, dashPanel doesn't indicate any error. It shows all three in the off state now: image

This shows up on the console:

[10/3/2020, 19:09:18] info: [19:09:18] 192.168.1.28 PUT /config/general {"options":{"manualPriority":false,"pumpDelay":false,"cooldownDelay":false}}
Failed:{"bytesReceived":153828,"success":6717,"failed":3,"bytesSent":2898}
Failed:{"bytesReceived":153930,"success":6720,"failed":4,"bytesSent":2950}
[10/3/2020, 19:09:21] warn: Message aborted after 3 attempt(s): 165,1,15,33,168,41,0,0,27,1,2,0,0,0,133,0,0,0,0,0,240,0,0,1,3,0,0,70,100,100,100,2,2,0,0,15,0,1,0,0,100,0,0,0,0,1,0,5,41
(node:32353) UnhandledPromiseRejectionWarning: ApiError: Message aborted after 3 attempt(s): 165,1,15,33,168,41,0,0,27,1,2,0,0,0,133,0,0,0,0,0,240,0,0,1,3,0,0,70,100,100,100,2,2,0,0,15,0,1,0,0,100,0,0,0,0,1,0,5,41
    at SendRecieveBuffer.writeMessage (/home/pi/src/nodejs-poolController/nodejs-poolController-next/controller/comms/Comms.ts:279:27)
    at SendRecieveBuffer.processWaitPacket (/home/pi/src/nodejs-poolController/nodejs-poolController-next/controller/comms/Comms.ts:229:29)
    at SendRecieveBuffer.processOutbound (/home/pi/src/nodejs-poolController/nodejs-poolController-next/controller/comms/Comms.ts:237:30)
    at SendRecieveBuffer.processPackets (/home/pi/src/nodejs-poolController/nodejs-poolController-next/controller/comms/Comms.ts:220:21)
    at Timeout.<anonymous> (/home/pi/src/nodejs-poolController/nodejs-poolController-next/controller/comms/Comms.ts:251:59)
    at listOnTimeout (internal/timers.js:549:17)
    at processTimers (internal/timers.js:492:7)
(node:32353) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 2)

So seems like dashPanel is ignoring the response it gets back. Would like to see the actual state reflected in the UI with some type of error message.

Might be nice to have the number of retries configurable. And maybe even increase the default. GIven the number of errors I see in my environment I would definitely increase the number of retries considering write actions are relatively rare and I would prefer things be more accurate at the expense of some delays. Assuming the extra retries don't cause more problems of course.

I should probably test with much higher retry limits and see what happens in the error cases.

Anyway, thanks for the new code, it's a big improvement..

rstrouse commented 4 years ago

The error should have made it to dashPanel. Perhaps there is a missing catch. I’ll check it out but I don’t get the failure. The number of message failures look really good in your screenshot so your comms are solid. Turn on the action 30 messages in messageManager. Let’s see if you are getting floods from the other panels on your bus.

george-basco commented 4 years ago

The captures above have everything except pump packets and 2/204 packets, so any 30's would show up.

I did 2000 "PUT /config/general" for the delay options and set the retries to 10. Of those 2000 requests: 1501 needed a single 168 request packet 410 needed 2 requests 58 needed 3 23 needed 4 5 needed 5 3 needed 6

So with no retries, I should see 25% failure rate. With 2 retries I should see 1.5% failure rate.

Not a lot on the bus at this time so numbers likely only get worse than this. Any reason not to configure a retry count of at least 10?

rstrouse commented 4 years ago

The retries are a balance. The greater number of retries simply delays a failure if one is to occur. We have found that 3 tries at it seems sufficient for most things. While setting the retry count higher is probably a good idea, 10 is probably a little high given the retries will spit a delay into the mix. With 10 that puts the minimum response time on failure to right around 18 seconds and that is probably too long. I don't know the code you used to send 2000 requests but if you didn't wait for the response some of those failures are due to a backlog on the queue. What is interesting is that some commands are lazier than others with the return. I wonder how many of the 23 that needed 4 just needed more patience.

The abort failure that showed up on your console should have been reflected all the way to the client with a 400 series http code. I am about to take a look at where we may have masked the error return. Async processing can be maddening with that.

rstrouse commented 4 years ago

Well I figured out why the rejection was unhandled. Turns out a simple return statement was missing in the catch for setGeneralAsync. There were 5 other places where the return statement was missing from the rejection. I also boosted the retries to 5 so the 168s will try 6 times before failing. Pull a new njspc and dashPanel. You can now upload your own backgrounds.

george-basco commented 4 years ago

Great, thanks!

Btw, my testing sleeps 1 second between requests. For the requests that needed 6 attempts, it took about 5 seconds for those requests to happen.

george-basco commented 4 years ago

Installed the latest and ran into one minor issue. The new code references the directory themes/images instead of themes/Images

rstrouse commented 4 years ago

As you can tell my world is not in the *nix space. Case sensitive directory searches were fine when we were limited to 8 ascii characters now its just pretty dumb considering the locale also ties to the loaded character set. I fixed it and checked it in.

rstrouse commented 4 years ago

Closing this after testing. njspc will now try more times to get the 168 to process for the options.