Open robfgreenwood opened 5 years ago
IDE logs are where you can look for more information (ST web page -> "Live Logging".
Nest has been making a lot of server changes lately, I have seen outages for a hour or more over this past weekend.
If you get it going again, you can post some logs (likely a private message in the forum is better for privacy), and if you see problems, I would like to see the logs.
Thanks for the info, apologies I wasn't too clear with my request, there are a number of devices that form the Nest Manager suite, each seem to create logs in the ST IDE logs, is there any one of them specifically that I should be looking at e.g. Nest Manager? You really don't want the full logs from my hub, there's a lot of traffic!
Hi there, just noticed that my setup isn't working as setup - when my heating comes on/off I run a Core piston that uses Echo speaks to announce the state and the temperature. This has stopped announcing updates, when I've looked at the Echo Speaks trace I see this...
647e632f-6e21-4af4-8929-430d1f3ea512 07:51:04: error java.lang.IllegalArgumentException: Failed to complete. Message was: Read timeout to echo-speaks-leabrbozu.herokuapp.com/54.154.117.52:443 of 20000 ms @line 1050 (cookieRefreshResp) 647e632f-6e21-4af4-8929-430d1f3ea512 07:51:04: trace cookieRefreshResp... 647e632f-6e21-4af4-8929-430d1f3ea512 07:50:44: debug wakeUpServer Completed... | Process Time: (103ms) 647e632f-6e21-4af4-8929-430d1f3ea512 07:50:44: trace wakeUpServerResp...
While I appreciate this seems unrelated, it's an interesting coincidence that I thought worth logging.
On issues, with updating nest, usually start by looking at nest manager. It gather the data from nest and will show if it is having communication problems. Many times going into nest manager and hitting done/save you way out ensures things are working (either because of ST platform issues or Nest server issues).
On the data above, that all appears to be echo speaks logs, so you should handle that in echo speaks areas. None of those logs are related to nest manager.
I'm having the same issue. I have had alternate temp sensors setup for a while now but I noticed the house getting really hot this last week. Looked at the thermostat and it hadn't moved all day. I adjusted it down to a reasonable 72F and it hasn't moved since...meaning the alternate sensors don't seem to be working at all. I'll try to find something in the logs as it comes up but I don't think @robfgreenwood is the only one having problems.
Possibly related to the developer program changes recently?
I have been having this issue recently too. Nst manager is becoming increasingly less reliable, failing to update my thermostats twice in the last few days. Every time this has happened i have noticed that the "Data last received" was not up to date and have had to go into the NST manager app and change some setting to force it to update. This fixes it for a bit but then it gets stuck again.
The last time this happened was just now. "Data last received" was 17:58. I had the diagnostic log switched on and looked at the activity around this time (see log text below and note the "6:03:53 PM: warn | Manager : NST Manager (v5.6.0) | finishPoll: Polling BLOCKED | Reason: (Sending Cmd)" message.)
Does this help diagnose the problem? Let me know if there is anything i can do to help fix this. I have been using NST manager for a long time and it has been brilliant. It's only recently that it has started to fail like this.
Thanks for your help!
6:07:08 PM: info | Manager : NST Manager (v5.6.0) | schedNextWorkQ ALREADY PENDING │ queue: 2 │ schedTime: 4 │ recentSendCmd: 3 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 6:07:08 PM: warn | Manager : NST Manager (v5.6.0) | finishPoll: Polling BLOCKED | Reason: (Sending Cmd) 6:07:07 PM: info | Manager : NST Manager (v5.6.0) | No Device or Structure poll - Devices Last Updated: 356 seconds ago | Structures Last Updated 355 seconds ago 6:03:53 PM: info | Manager : NST Manager (v5.6.0) | schedNextWorkQ ALREADY PENDING │ queue: 2 │ schedTime: 4 │ recentSendCmd: 3 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 6:03:53 PM: warn | Manager : NST Manager (v5.6.0) | finishPoll: Polling BLOCKED | Reason: (Sending Cmd) 6:03:52 PM: info | Manager : NST Manager (v5.6.0) | No Device or Structure poll - Devices Last Updated: 161 seconds ago | Structures Last Updated 160 seconds ago 6:01:19 PM: info | Manager : NST Manager (v5.6.0) | schedNextWorkQ RUNIN │ queue: 2 │ schedTime: 4 │ recentSendCmd: 3 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 6:01:19 PM: info | Manager : NST Manager (v5.6.0) | Adding Cmd to Queue 2 (qsize: 1): UWqQvP1f0HH7A1aBCaD8fMYFHiyp6yzJu9ffh8RbAxGPu8ZoxbqaIg, poll, poll, 0, NestWeather | UWqQvP1f0HH7A1aBCaD8fMYFHiyp6yzJu9ffh8RbAxGPu8ZoxbqaIg 6:01:19 PM: debug | Weather : Nest Weather (00000) (5.5.1) | Polling parent... 6:01:12 PM: info | Manager : NST Manager (v5.6.0) | API Device Data HAS Changed (POLL(ASYNC)) 6:01:11 PM: info | Manager : NST Manager (v5.6.0) | Updating Structure Data (Last Updated: 206 seconds ago) | Updating Device Data (Last Updated: 206 seconds ago) (async) 5:58:11 PM: trace | Automation-f4217 : Nest Location Home Watchdog | TEMPERATURE Event | Device: Nest Thermostat - Kitchen | Value: (20.0) with a delay of 35ms 5:57:45 PM: info | Manager : NST Manager (v5.6.0) | API Device Data HAS Changed (POLL(ASYNC)) 5:57:44 PM: info | Manager : NST Manager (v5.6.0) | Updating Structure Data (Last Updated: 163 seconds ago) | Updating Device Data (Last Updated: 163 seconds ago) (async)
Sorry for the trouble
I have been seeing some issues also. In my watching most recently I have seen ST timeout occurring - usually a sign the platform is have "issues".
going into Nest manager and hitting done, usually corrects this.
Tony - do we need ST folks to look at some of the timeouts?
I increased my polling interval to 5 mins. Thought that had done the trick but just noticed it’s stuck at data last received 4 hours ago :(
Stuck again at 6:09:13 BST this morning. This is getting ridiculous
And again at 19:50:44 BST. Seriously, can I help debug this?
I’ve been trying a few things since raising this, polling, logs, nest health monitors. But I’m running out of sensible options. The stall doesn’t appear to be related to when NST manger reports API issues at Nest, when it does this it appears to recover. When it stops working there are no notifications, this is beginning to make me think that NST manager has hung and stopped responding. It hung at 19:45 BST yesterday, this would have been approximately the time that doors would have been triggering the contacts open routines. @MaltZero has offered to help debug this, but right now I don’t know what to be looking for to help.
Just noticed this in the Watchdog logs, you can see that we lost some connection between 19:53 and 22:48, could someone shed some light on why it would report the below?
detailEventHistory[26]: (Map) │ ├── date: (Sat Apr 06 19:53:52 BST 2019) (String) │ ├── displayName: (Nest Thermostat - Hallway) (String) │ ├── name: (temperature) (String) │ ├── unit: (C) (String) │ └── value: (22.0) (String) ├── detailEventHistory[27]: (Map) │ ├── date: (Sat Apr 06 22:48:51 BST 2019) (String) │ ├── displayName: (Hunters Moon) (String) │ ├── name: (mode) (String) │ ├── unit: (null) (unknown) │ └── value: (Night) (String) ├── detailEventHistory[28]: (Map) │ ├── date: (Sun Apr 07 07:08:28 BST 2019) (String) │ ├── displayName: (Hunters Moon) (String) │ ├── name: (mode) (String) │ ├── unit: (null) (unknown) │ └── value: (Home) (String) └── detailEventHistory[29]: (Map) ├── date: (Sun Apr 07 07:49:25 BST 2019) (String) ├── displayName: (Nest Thermostat - Hallway) (String) ├── name: (temperature) (String) ├── unit: (C) (String) └── value: (21.5) (String)
OK what's going on at about 19:50 BST?!
It could be coincidence but i reported a hang at 19:50:44 BST 5 days ago. I had another one at 19:55:22 BST on Friday and @robfgreenwood reported a hang at around 19:53 BST on Saturday.
Thanks for the info, @robfgreenwood. It's good to know that NST manager can recover from Nest API issues but that does seem to indicate problems with NST manager itself or the SmartThings platform.
I don't know when this started to happen but perhaps it coincided with a NST manager update (has the code in this area changed recently?), or an update to the backend.
If i get time to look at this my plan was to add some debug logging to the polling code that seems to hang. Perhaps get it to notify me when it hangs, but I have not looked at the code yet.
Ideally it would handle the error case and force a reset in the same way that going into the automation and pressing "Done" seems to fix it. Like i say, i have not looked at the code so this is just me speculating :)
Minor update:
I added some code to send me a notification when polling is blocked with reason "(Sending Cmd)" and this occurred 3 times yesterday at 19:05, 19:20 and 19:34 BST. Note that there was no hang. It seemed to recover after each event and things are still working. These happened at roughly 15 minute intervals and are suspiciously close to the dreaded 19:50 BST.
I will add some more details to my notification to get more info. Wondering if there is a problem with the command queue logic when multiple commands are blocked. I did notice what looks like a force update but this occurs only under certain conditions:
if( (atomicState?.apiRateLimited && lastDevUpd > 35*60) || (lastDevUpd > 45*60 && (atomicState?.needChildUpd || atomicState?.forceChildUpd) ) )
{
LogAction("finishPoll: ReRunning Updated() | Polling blocked | Last device update ${lastDevUpd} | Rate Limited: ${atomicState?.apiRateLimited}", "warn", true);
runIn(5, "updated", [overwrite: false]) // ensure it does not keep delaying
return
}
So...
If we are API rate-limited and the device hasn't been updated for more than 35 minutes,
or:
The device hasn't been updated in more than 45 minutes and either the child (device?) needs, or is forcing, an update.
Then:
Call the function that gets called when we update the smart app settings (which presumably is what happens when you click "Done") and seems to fix it.
Perhaps in the case where we see the hang these conditions are not being met and we never get the force update?
The core issue is what is "hanging" in the first place - is it ST platform losing a response from nest, or it is nest not returning a response (in this case, ST is supposed to timeout the async command request.
Updated is a "big hammer", I'm concerned on doing this. Tony should look at this and give his view.
It would be good to see the logs at the "magic" times you have noticed.
I'm trying to understand do you have actions running (sending commands to Nest) near that time, or is it "quiet" in that only polling is running and no commands are being sent.
It would be good to catch the hang, and to be able to see if it is data gathering related, or commands being sent also.
As far as i know i don't have any actions running near that time, just polling. One of my thermostats has a temperature heating point set at 20:00 BST but this is done via the Nest schedule, not SmartThings.
The only logging of a hang i have been able to capture is shown above in a previous post. I managed to catch it as it occurred less than 48 hours after turning on the logging option in the NST manager app following a previous hang.
Do you have any suggestions as to how i can catch this in the logs? Can i turn on logging permanently rather than just for 48 hours?
Don't want to tempt fate but things have been working fine for the last few weeks :)
Unfortunately can't say the same - there really doesn't seem to be anything I can point at that is an "event". Like MaltZero it was starting to look stable again, then in the space of the last few days I've had at least three hangs, the only thing I can say about these ones is that we're currently on vacation so definitely nobody manually creating a problem.
@MaltZero interested whether you're still working OK? I'm still seeing multiple hangs, so was wondering if you'd made any changes to polling intervals at all? I've been able to see a bit more of the logging over the weekend and what I saw maybe suggested that Nest were blocking polling calls (rate limits?) and when I went into the NST Application preferences I could see that "Polling" wasn't active. I've changed the Device Polling interval to 10 mins (from 3) to see if rate limits maybe a problem. I would guess NEST rate limit by credential - so maybe why it's not widespread? So was wondering what the polling rate you have are.
Device (10 minutes) Structure (3 minutes) Weather (15 minutes) Forced Poll Refresh (10 seconds)
Secondly I'm experimenting with a Core Piston to monitor NST Manager - it seems we can monitor the lastConnection date and time. So I'm running some experiments to at least get notified if this time stops being updated.
Hey @robfgreenwood.
Here are my settings:
Device (5 minutes) Structure (4 minutes) Weather (15 minutes) Forced Poll Refresh (30 seconds)
I hadn't seen any problems until the evening of Monday 22nd (BST). I can't pinpoint the exact time though i'm afraid.
When i checked, the thermostat devices in the app were reporting a "last data received" date of April 8th! but given that i had not received any debug push notifications since around that time i wonder if i had actually broken it (i noticed no "play" button next to NST Manager in Automations). Out of interest, what happens when you publish a smartapp or device handler that doesn't "compile"?
I see that there has been a recent update to the queue logic so will keep an eye on things. i also removed my debug push notifications as i realised there is an option in the smartapp settings to be notified when things "go wrong".
Your CorePiston to monitor things sounds great but does this do anything in addition to the builtin in notifications? have you had any success with these ("API CMD Failures", "API Rate-Limiting" etc)?
Hi @MaltZero I upgraded to the latest version 5.6.2 on the 9th. It took a little while for the "play" button to re-appear, not sure what happens TBH if it doesn't compile. I would assume that it fails run time...
I've found the NST Manager error notification a little hit and miss TBH, I am thinking that if NST manager hangs then everything is hung and polling stops - this also stops the notifications (@imnotbob could you comment?). There was a point that this was notifying me when the NEST API was playing up, but I don't get anything when we seem to hang.
This led me to experiment with Core - basically an external monitor, it actually seems to be spotting a hang so I've got chance to take a look at logs etc. I'm using three checks the vStat temp hasn't changed for an hour, the $currentEventDate is [null] (when we fail this appears to happen) and finally the lastConnection parameter doesn't change for 15 mins.
Out of interest, do you use automatic Eco modes e.g. I have "All Away" go to Eco, plus when the doors are left open - go to Eco, there are some other scenarios I have. I've seen some odd behaviour coming out of Eco - but can't be 100% sure it's not a red herring.
Sorry for the slow reply, @robfgreenwood.
I don't use auto Eco modes, no. I have home, away and night temperatures, and a core piston that checks if i am "staying away" and only in that case puts the thermostats into Eco mode (which has a much lower temperature than my normal away temperature).
On Sunday evening i noticed that my home had been stuck on "home" temperatures since the previous Thursday. Grrr. So i increased my polling intervals and kicked it back into life. The following day, yesterday, i went out in the afternoon and decided to check what the situation was at home. Sure enough, still on "home" temperatures. Stuck again at 5:20 am (i think) :(
I was wondering. Is this anything to do with having structure, device and weather polling intervals that synchronise every now and then and end up hitting Nest with multiple queries simultaneously?
Surely other people must be having this problem too? Can it be the number of devices? I have 3 Protects and 2 Thermostats. @robfgreenwood, how many devices do you have?
Tbh, I don't think I really need the system to poll Nest at all. I don't use the SmartThings devices to control the thermostats. I just need the ability to set temperatures and heating modes from other automations (e.g. core). I seem to remember there is a lightweight install mode? maybe i'll have a look into what that gives me.
In the meantime i am going to reinstate my debug notification code as you are right, the ones i enabled in the smartapp have not notified me of any of these hangs.
I really don't want to have to spend much more time trying to fix this and am considering abandoning the whole thing. It would be more of a fuss but i could probably achieve most of what i want using IFTTT. The only thing is that doesn't have the level of control that NST manager does (no ability to set Eco mode, for example). I don't want to give up on this but reliability is paramount!
@robfgreenwood, is there any way you can share your core piston monitor?
I don't think the polling interval (if you are using the defaults is the issue). If it were you would see errors during polls in the "Live Logging" in regard to this.
I have more devices than you describe, so I don't think it is the device count (at least at the levels you show).
Sorry for the troubles, if anything shows up in "live logging", that is good to see.
Thanks for the info, @imnotbob. Can you remind me what the defaults are? i will reset my polling intervals to those values
Device and structure is 180sec
Hi All, lots of problems today, I've been able to get some info from live logging. Lots of Java exceptions etc. It'd be really good to get a view on the logs, I really do need to give up on this soon.
NST Errors
462b2885-9f0e-4d86-9394-106392e44ab6 14:39:51: info ||| NST Manager (v5.6.3) | schedNextWorkQ ALREADY PENDING │ queue: 0 │ schedTime: 4 │ recentSendCmd: 2 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 462b2885-9f0e-4d86-9394-106392e44ab6 14:39:51: info ||| NST Manager (v5.6.3) | Skipping Cmd to Queue 0 (qsize: 2): ABC, poll, poll, 0, NestWeather | ABC 462b2885-9f0e-4d86-9394-106392e44ab6 14:39:51: warn || NST Manager (v5.6.3) | Command Matches last in queue 0 - Skipping 462b2885-9f0e-4d86-9394-106392e44ab6 14:27:02: info ||| NST Manager (v5.6.3) | schedNextWorkQ ALREADY PENDING │ queue: 0 │ schedTime: 4 │ recentSendCmd: 2 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 462b2885-9f0e-4d86-9394-106392e44ab6 14:27:02: info ||| NST Manager (v5.6.3) | Skipping Cmd to Queue 0 (qsize: 2): ABC, poll, poll, 0, vyMENi8DhEZniTTCIfQ3G9Ji8zgyYa2Cd 462b2885-9f0e-4d86-9394-106392e44ab6 14:27:02: warn || NST Manager (v5.6.3) | Command Matches last in queue 0 - Skipping 462b2885-9f0e-4d86-9394-106392e44ab6 14:27:02: info ||| NST Manager (v5.6.3) | schedNextWorkQ ALREADY PENDING │ queue: 0 │ schedTime: 4 │ recentSendCmd: 2 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 462b2885-9f0e-4d86-9394-106392e44ab6 14:27:01: info ||| NST Manager (v5.6.3) | Skipping Cmd to Queue 0 (qsize: 2): ABC, poll, poll, 0, yMENi8DhEZniTTCIfQ3G9Ji8zgyYa2Cd 462b2885-9f0e-4d86-9394-106392e44ab6 14:27:01: warn || NST Manager (v5.6.3) | Command Matches last in queue 0 - Skipping
… Hung @ 08:57 BST
462b2885-9f0e-4d86-9394-106392e44ab6 07:40:09: trace | NST Manager (v5.6.3) | workQueue │ Run Queue: 0 | (async) 462b2885-9f0e-4d86-9394-106392e44ab6 07:39:56: info ||| NST Manager (v5.6.3) | schedNextWorkQ ALREADY PENDING │ queue: 0 │ schedTime: 4 │ recentSendCmd: 2 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 462b2885-9f0e-4d86-9394-106392e44ab6 07:39:56: warn || NST Manager (v5.6.3) | finishPoll: Polling BLOCKED | Reason: (Sending Cmd) 462b2885-9f0e-4d86-9394-106392e44ab6 07:39:56: info ||| NST Manager (v5.6.3) | No Device or Structure poll - Devices Last Updated: 62 seconds ago | Structures Last Updated 190 seconds ago 462b2885-9f0e-4d86-9394-106392e44ab6 07:39:51: info ||| NST Manager (v5.6.3) | schedNextWorkQ RUNIN │ queue: 0 │ schedTime: 4 │ recentSendCmd: 2 │ last seconds: 3601 │ cmdDelay: 4 | runInActive: true | command proc: false | Api Limited: null 462b2885-9f0e-4d86-9394-106392e44ab6 07:39:51: info ||| NST Manager (v5.6.3) | Adding Cmd to Queue 0 (qsize: 1): ABC, poll, poll, 0, NestWeather | ABC 462b2885-9f0e-4d86-9394-106392e44ab6 07:38:53: info ||| NST Manager (v5.6.3) | Updating Device Data (Last Updated: 222 seconds ago) (async) 462b2885-9f0e-4d86-9394-106392e44ab6 07:38:23: error java.util.concur...[TRUNCATED] 462b2885-9f0e-4d86-9394-106392e44ab6 07:38:23: error updateChildData E...[TRUNCATED]
….
462b2885-9f0e-4d86-9394-106392e44ab6 07:25:29: error java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 122774389702412 @line -1 (sendExceptionData) 462b2885-9f0e-4d86-9394-106392e44ab6 07:25:29: error updateChildData Exception: java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 122774389702412 @line -1 (doCall) 462b2885-9f0e-4d86-9394-106392e44ab6 07:24:52: info ||| NST Manager (v5.6.3) | No Device or Structure poll - Devices Last Updated: 114 seconds ago | Structures Last Updated 68 seconds ago 462b2885-9f0e-4d86-9394-106392e44ab6 07:24:11: error java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 122697220703384 @line -1 (sendExceptionData) 462b2885-9f0e-4d86-9394-106392e44ab6 07:24:11: error updateChildData Exception: java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 122697220703384 @line -1 (doCall) 462b2885-9f0e-4d86-9394-106392e44ab6 07:23:55: error java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 122684980344219 @line -1 (sendExceptionData) 462b2885-9f0e-4d86-9394-106392e44ab6 07:23:55: error updateChildData Exception: java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 122684980344219 @line -1 (doCall) 462b2885-9f0e-4d86-9394-106392e44ab6 07:23:43: info ||| NST Manager (v5.6.3) | Updating Structure Data (Last Updated: 152 seconds ago) (async) 462b2885-9f0e-4d86-9394-106392e44ab6 07:23:17: info ||| NST Manager (v5.6.3) | No Device or Structure poll - Devices Last Updated: 19 seconds ago | Structures Last Updated 126 seconds ago
Virtual Thermostat
c600d124-bb6f-4bd4-87cc-1f69a8fa1126 07:37:16: error java.util.concurrent.TimeoutException: Execution time exceeded 20 app execution seconds: 123486254660337 @line -1 (doCall)
Things have not been good recently :(
I updated to the latest version yesterday and polling seems to be completely broken now.
What's with "Polling: (Not Active)"?!
This would be fine as i am still able to change the temperature on my thermostats, but i rely on the Nest weather data for some lighting automation which is not working due to this not updating :(
About Your Setup:
I use the Virtual Thermostat capability to run Zones and Nest Manager to override the default Nest Schedules and temps, I do have Nest in learning mode so it "learns" what Nest Manager is doing.
Over the past few days (not certain how many) but I've found that Nest Manager seems to stop receiving data from Nest - API, Network Status OK etc. I've got all alerts turned on currently but getting no notifications for this state.
The Nest Checked-In and Data Last Received though can be hours out of date.
Initialising the Automations seems to kick start everything and the updates start flowing.
SmartThings is F/W 000.024.00020
I'm at a bit of a loss how to debug this to be honest, because I'm getting to alerts and initialising the Automations resolves the problem I'm making an assumption that the Automation has hung? Is there somehow I can get more info? Will the logs show anything or that it's stopped?
Thanks for any pointers on this, the zoning is massively useful here.