Schedule Create/Delete is a very long operation (15-20 seconds)

mikemucc commented 3 years ago

TL;DR version: it looks like it's taking between 15-20 seconds for the Create New Schedule ID/Delete Schedule ID calls to successfully complete. I believe that this may be the reason I'm not seeing the ACKs on Create/Delete Schedule.

The Long Version: I'm now keeping a Set() object with the IDs of all of the Schedule Events (daily & run once) that exist on the system so I can quickly check to see if it exists (and return a 404 if somebody tries to Update or Delete an event ID that doesn't exist). I update this set every time I poll the ScreenLogic for new data. That polling frequency is currently 5 seconds.

So I end up with an object that looks kinda like this: { 700, 701, 702, 703, 704 }

So, from the perspective my my REST API:

Start up the API. The API does its initial data load and logs all of the Schedule Event IDs that currently exist on the system (console.log() the Set() object that is schedules.eventIds).
I call the "create New Schedule Event" endpoint.
I watch the API logs make the call back to the ScreenLogic to create a new Schedule Event
My API Polls the ScreenLogic Device immediately after making the CreateSchedule call (Lets cal this T=0) 3a. During that poll, the API prints all of the current Schedule Event IDs on the system again. This is currently still the same as in step 0.
5 seconds later, the API polls again (Call this T=+5 seconds). The set of Schedule Event IDs on the system hasn't changed.
Repeat Step 4 (Call this T+10)
Repeat Step 4 (T+15)
API Polls (T+20). Now the Schedule IDs on the system has incremented by 1 (i.e. the new schedule has actually been created).

So the apparent behavior is that a Create or Delete schedule event takes between 15 and 20 seconds to complete (at least on my system, the backend of which is an EasyTouch). This could be a result of my hardware, maybe the EasyTouch is a little slower than some of the other systems, who knows.

So I thought I remembered @bshep adding a keepalive value to the client. Could this be related to the behavior I'm seeing?

I could modify that line in my local cache to something like this.client.setKeepAlive(true, 25 * 1000); and see if it affects the behavior I'm seeing (if you think that would have any affect).

parnic commented 3 years ago

Yes, please test with different keep alive values.

mikemucc commented 3 years ago

Keepalive values don't seem to make a difference. I've tested with this.client.setKeepAlive(true, 2 * 1000) and this.client.setKeepAlive(true, 30 * 1000). I still don't see the callbacks.

That said...

It appears that the ScreenLogic system actually blocks while it is handling a schedule create/delete, and stops answering events altogether.

I have set up my polling loop to print logs for "onVersion", which it should get every 5 seconds. Here's the logs for a normal loop:

Schedule Poll for runOnce
Getting schedule info...
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Schedule Poll for daily
Getting schedule info...
Got Version �8POOL: 5.2 Build 738.0 Rel

Here's a loop immediately after calling addNewScheduleEvent(0). Every time you see Polling..., 5 seconds have elapsed...

Schedule Poll for runOnce
Getting schedule info...
{ type: '0' }
Logged in...
Sending new event data... -> This is the addNewScheduleEvent(0) call
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Got Version �8POOL: 5.2 Build 738.0 Rel

So I don't get a callback on my onVersion event for almost 20 seconds.

I suppose our next question is, do we see similar behavior when an official app creates/deletes a schedule... I'll test that and post the results in a few.

mikemucc commented 3 years ago

This is what it did when I pulled up my phone app and went to the Schedules > Schedules page, it literally hung the ScreenLogic unit so it wasn't returning ANY callbacks. I'll bet it crashed my running instance upstairs too.

Schedule Poll for runOnce
Getting schedule info...
Polling...
Set(5) { 700, 701, 702, 703, 704 }  ->  Opened App > Schedules > Schedules here...
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Polling...
Set(5) { 700, 701, 702, 703, 704 }
Polling...
Set(5) { 700, 701, 702, 703, 704 }
... deleted 20 lines ... (~50 seconds passed)
Polling...
Set(5) { 700, 701, 702, 703, 704 }
events.js:292
      throw er; // Unhandled 'error' event
      ^
Error: connect ETIMEDOUT 10.168.10.146:80
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)

mikemucc commented 3 years ago

I've now discovered that any interaction with the outdoor panel (EasyTouch in my case) will cause the ScreenLogic to stop responding to network requests... Perhaps that's how they implemented scheduling updates, they made it block as if it's being interacted with outdoors

Regardless, modifying the client.setKeepAlive() values make no difference, and I'm not surprised if the system is literally blocking when doing Schedule operations.

Another interesting behavior I discovered though is that when the EasyTouch is in Service mode, the Create schedule callback comes back every time. I didn't test delete schedule but it's probably safe to say it comes back quickly too.

So either this embedded device just sucks at multitasking (which is quite possible) or there's something else going on with the state of the device when we send a Schedule CRUD command, or perhaps we need to send it something to tell it "prepare for schedule CRUD" before we start hitting it with schedule messages.

parnic commented 3 years ago

In messing with this tonight, I'm seeing the same random hangs you are describing for add/delete schedule events. Sometimes it completes just fine, sometimes it doesn't. I don't know the rhyme or reason.

mikemucc commented 3 years ago

Perhaps we should look at the Android app and see what it does when setting schedules. ~~Because I see none of this oddity when using the official app.~~ Let me rephrase that, I see this oddity show up in our stuff when I use the official app (as noted above), but it seems the official app either masks the odd behavior or knows something we don't.

That said, I'm adding a poll skip mechanism for this so that I don't stack up a bunch of client.connect()'s and crash the API.

parnic commented 3 years ago

You could use the addClientAPI to not have to connect a bunch of new clients.

mikemucc commented 3 years ago

Yes, that's on my roadmap. I figure it'll probably require at least somewhat of a refactor.

Does it just work like the standard client, i.e. call the same functions? I'm guessing then I would do an initial data load from the screenlogic and close those clients, then open this one and simply update the data object when I see the various events?

Perhaps it won't be so much of a refactor...

parnic commented 3 years ago

Does it just work like the standard client, i.e. call the same functions?

Yes

bshep commented 3 years ago

Sorry for being late to the conversation, a couple of points:

when you call the createScheduleEvent end point the reply from the controller includes the new id so you shouldn't need to poll
beware that you can create more schedules than your control panel supports ( i have an easytouch ), i created a bunch of schedules but on the first 8 worked the rest were listed but didnt execute, seems like the screenlogic has a separate memory for the schedules than the control panel....
its possible that having multiple clients connected affects the functioning of the create schedule process between the screenlogic and the control panel, i have only ever tested it with one connection active and it works instantly ( i had a bug initially and created about 100 events in a couple of seconds before i killed it!)

mikemucc commented 3 years ago

1. when you call the createScheduleEvent end point the reply from the controller includes the new id so you shouldn't need to poll

Yes, I'm aware of this. You pointed it out in the other thread, and I'm trying to use that.

The issue we're describing is that we're not reliably getting the createScheduleEvent back from the ScreenLogic, because the ScreenLogic seems to hang for some reason on these Schedule calls.

I had my API code set up to wait for that event and return the Event ID in the response JSON. Sometimes the call returns instantly, sometimes it does not. Having an official (i.e. phone) client connected seems to cause it to not respond to any events for a while.

I don't poll the ScreenLogic for the schedule data here, I poll the ScreenLogic for all of the data it contains (that the library supports) and massage that into a persistent object, portions of which get returned for the various API calls.

3. its possible that having multiple clients connected affects the functioning of the create schedule process between the screenlogic and the control panel, i have only ever tested it with one connection active and it works instantly ( i had a bug initially and created about 100 events in a couple of seconds before i killed it!)

This is not the behavior we're seeing. Sometimes it immediately returns, and sometimes it hangs.

mikemucc commented 3 years ago

@bshep What was the bug that created 100+ schedules? Was there something in the code that was perhaps accelerating the responses?

mikemucc commented 3 years ago

In messing with this tonight, I'm seeing the same random hangs you are describing for add/delete schedule events. Sometimes it completes just fine, sometimes it doesn't. I don't know the rhyme or reason.

@parnic Are you just using your library straight up, or are you going through my API?

parnic commented 3 years ago

In messing with this tonight, I'm seeing the same random hangs you are describing for add/delete schedule events. Sometimes it completes just fine, sometimes it doesn't. I don't know the rhyme or reason.

@parnic Are you just using your library straight up, or are you going through my API?

Just my library.

mikemucc commented 3 years ago

I'm in the process of converting over to addClient (basically done) and I'll dig back into schedule updates. It's amazing how much faster the API is now. I'm now down to a single streaming "data-in" client pulling data, and will create and destroy secondary clients for update operations (i.e. circuit changes, schedule updates, heat mode changes, etc).

I'm hoping that maybe this will alleviate the update issues we've been seeing. Will update once I have info.

mikemucc commented 3 years ago

So I'm running my refactored async api client and have started messing with this again.

Of note is that the load on the screenlogic box should be significantly lower now, as I'm not polling it every 5 seconds.

Today I'm running the api and trying to add/remove schedules and while I'm still seeing the delays on return of a create/delete schedule, I'm also seeing a message 12501 which resolves to either "Schedule Changed" (page 18) or "Runtime Changed" (page 13) according to CEisenach's document. I'm wondering if we need to do something with that 12501 message?

Now it seems like the first time I send a create new/delete by id message it hangs, then seems to return immediately every subsequent time.

I can now confirm that this return of a 12501 is consistent for the first time the library sends a create new schedule or delete schedule. The subsequent times I get back the expected response.

mikemucc commented 3 years ago

...and back on the message ID 8300 front, I'm seeing that too, but only when the system seems to be "busy". It tends to rear it's ugly head when I try to add or delete several schedules in a row (which is probably something that a normal user would never do via the apps, as the UIs are not conducive to adding or deleting multiple schedules within a short period of time).

The more often I see it the more often I'm beginning to think it really is a "system busy, please wait" message while it "flushes to disk". Especially since I'd assume that it needs to write schedule data to some kind of internal storage.

If we were going to "handle" that message, I'd suggest handle it as an expected response for any create/update/delete operation as a "I didn't catch that the firs time because I'm busy writing stuff and can't multitask, can you wait n seconds and send again?", i.e. expect it and handle it with some retry/back off logic around.

Ah embedded systems... if only we could get Pentair to release the actual spec.

mikemucc commented 3 years ago

Have we seen a message ID 9806 before? Some new messages for the record:


  sl:unit received message of length 12 +10ms
  sl:unit   it's an unknown type: 8300 +0ms
  sl:unit received message of length 8 +5ms
  sl:unit   it's an unknown type: 12501 +0ms
  sl:unit received message of length 12 +5s
  sl:unit   it's an unknown type: 8300 +0ms
  sl:unit received message of length 12 +3s
  sl:unit   it's an unknown type: 9806 +0ms```

parnic commented 3 years ago

Thanks for this info!

9806 looks like it's HLM_WEATHER_FORECASTCHANGED 🤷‍♂️

8300 I don't have any good ideas on handling. Your idea could work, but only if 8300's come back with the same sender id. The library would presumably need to inject a monotonically-increasing schedule id when one isn't specified by the caller in order to support your suggestion. I wouldn't be opposed to it if it behaves consistently. Wanna open a PR? :)

The official app source says 12501 is HLM_POOL_SCHEDULECHANGED. I suppose I should add handling and documentation for that. I appreciate you doing this research!

mikemucc commented 3 years ago

Ah, so then I think we can all agree that there is no reason to handle the 9806, as we can get better weather forecasts from other places. :-) It is interesting that I only started getting the 9806 after I started using the async client, I wonder what other fun behaviors we'll uncover because of it...

parnic commented 2 years ago

...nearly two years later...

@mikemucc I may have just identified what you were seeing here. Maybe. v1.10.0 fixes a nasty bug that's been present for quite some time wherein the library could appear to hang or simply stop handing off messages if multiple messages were received at the same time in the data handling loop. I saw this happen on my Raspberry Pi that runs my MagicMirror with a module that uses this library.

I'm seeing much more reliable behavior now in this scenario, so if you're still using this library and seeing this issue, please check out the new version!

edit: otherwise, if the system hangs when schedules are created, there's not much the library can do to mitigate it.

mikemucc commented 2 years ago

Yay!! I'll pull in the latest version of the library and test out my scheduling calls.

mikemucc commented 2 years ago

Initial testing works! A create and delete schedule operation returns nice and fast! Excellent work. The API routes are already coded, so i just need to do some cleanup on them and I'll push and update to my API code. @bshep We're getting closer... so now, how's your Angular?

parnic / node-screenlogic

Schedule Create/Delete is a very long operation (15-20 seconds) #44