Closed lux4rd0 closed 1 year ago
I still see my Z-wave devices go "dead," but a quick ping brings them back. I have automation that runs to bring them back online, but they don't always work. If I manually click "ping" on the devices, they return to "Alive."
Before the last few months - I rarely, if ever, had Z-wave device issues. I believe it combines both Z-wave devices getting temporarily marked dead and the automation configuration of the 10-second time-out.
It looks like there's progress on the integration framework, but happy to replicate it again.
It's frustrating enough that I'm about to try and migrate all 54 of my devices from my husbzb-1 to a new Zooz 800 stick and Sonoff Zigbee stick. Knowing that there are some known issues with the older sticks - should I wait to see if there's a fix here or do the work to unpair and repair all of my devices?
Thanks again for all of the hard work tracking these items down!
Did a update a couple of days ago, and damn. I have now 4 automation running and have frozen.... I cant continue like this as it messes up everything in the house. Anyone know how far I have to roll back?
Confirmed that what we think is the fix will be in the beta. I can appreciate the need to be stable and to not create issues elsewhere, but if anyone who's running into this issue would be willing to join the beta and test, it would be much appreciated
I still see my Z-wave devices go "dead," but a quick ping brings them back. I have automation that runs to bring them back online, but they don't always work. If I manually click "ping" on the devices, they return to "Alive."
Before the last few months - I rarely, if ever, had Z-wave device issues. I believe it combines both Z-wave devices getting temporarily marked dead and the automation configuration of the 10-second time-out.
It looks like there's progress on the integration framework, but happy to replicate it again.
It's frustrating enough that I'm about to try and migrate all 54 of my devices from my husbzb-1 to a new Zooz 800 stick and Sonoff Zigbee stick. Knowing that there are some known issues with the older sticks - should I wait to see if there's a fix here or do the work to unpair and repair all of my devices?
Thanks again for all of the hard work tracking these items down!
I have the same issue with requiring a ping automation to ping dead zwave devices. I have a silabs 700 stick. Please reply back and let me know if changing to a 800 stick fixes the problem. I read you cant restore and have to redo the entire network though?? :(
Confirmed that what we think is the fix will be in the beta. I can appreciate the need to be stable and to not create issues elsewhere, but if anyone who's running into this issue would be willing to join the beta and test, it would be much appreciated
I would be willing to test but is it trivial to go back on the stable release train after trying a beta?
It seems like people here really tried to dig into what was going on, which I appreciate, and so I will share the details of the problem/fix to reward you for your efforts (or if you don't care, feel free to ignore this message):
The problem (we think): TLDR it was a timing problem that was caused by having a distributed environment.
The driver: Z-Wave JS maintains node statuses for devices: alive/dead and awake/asleep. I believe the only transitions device can make are between those pairs (You can learn more here: https://zwave-js.github.io/node-zwave-js/#/api/node?id=status). When a command is issued to a device that is alive or awake, the driver immediately passes the command along and presumably gets a response from the device. When a command is issued to a device that is asleep or dead, then it adds the message to a queue and waits for the node to get back to the other statuses before sending the message along.
The integration library: When we know a message is immediately sent to the device, we'd like to wait for the response so we know what the result was; this is obviously particularly useful in the case of a failure. When we think a message is queued, we won't wait for the response because we don't know how long it's going to take and we don't want to hang indefinitely.
We think what is happening is that the library thinks the device is alive/awake when we send the command, so we wait for the response, but then the device goes to asleep/dead, so the driver queues the message. Basically, what we try to avoid with the above logic would happen, and the end result would be a hanging service call because of the removal of the 10 second timeout.
The potential fix: It was fortunately relatively simple - we'll still wait for the response if the device is awake/alive, but if the device changes statuses, the library will know and will stop waiting. The message is queued on the driver side still so presumably it will get sent at some point, but we aren't going to wait for that. That's effectively what we think the 10 second timeout was doing - we were in a state where we were waiting for more than ten seconds, so HA cancelled the call, but the message was still queued.
Assuming this is the fix, thank you again to those who shared all of the details that helped us reach this conclusion, which wasn't obvious without having that level of detail.
Thanks @raman325 for that update and the work spent on this. That zwave instability issue sure does cause some pain across the board. :( Though adding more robustness like this sounds like a good thing in the long run regardless.
When a command is issued to a device that is alive or awake, the driver immediately passes the command along and presumably gets a response from the device. When a command is issued to a device that is asleep or dead, then it adds the message to a queue and waits for the node to get back to the other statuses before sending the message along.
Would you happen to know what other integrations operate this way thereby making them susceptible to the bug?
For example, ZHA has been reported to cause the 'stuck' automation problem so does it use the same "queueing when asleep/dead" mechanism you described?
When a command is issued to a device that is alive or awake, the driver immediately passes the command along and presumably gets a response from the device. When a command is issued to a device that is asleep or dead, then it adds the message to a queue and waits for the node to get back to the other statuses before sending the message along.
Would you happen to know what other integrations operate this way thereby making them susceptible to the bug?
For example, ZHA has been reported to cause the 'stuck' automation problem so does it use the same "queueing when asleep/dead" mechanism you described?
No idea but it wouldn't surprise me if it's in a similar class of issues. These mesh networks are designed for unreliability so things probably do get queued, and I think the general desire that I described for how to handle it (when to wait versus when not to wait) is not specific to Z-Wave in an environment where you want service calls to return reasonably fast
Confirmed that what we think is the fix will be in the beta. I can appreciate the need to be stable and to not create issues elsewhere, but if anyone who's running into this issue would be willing to join the beta and test, it would be much appreciated
I would like to be part of the beta test for this as I am also facing this issue daily.
Kindly let me know how to join the beta release/test.
Thx
Thanks @raman325 for that update and the work spent on this. That zwave instability issue sure does cause some pain across the board. :( Though adding more robustness like this sounds like a good thing in the long run regardless.
Kindly let me know how to join the beta release/test.
go to https://www.home-assistant.io/ and type in beta in the upper right search bar. There will be multiple results, each for a different installation method. Pick the one that you use and that should get you sorted
I’m seeing the same issue with the 2023.9 beta that was released today. I didn’t have debug logging on but I’ll try again in the morning; I’ve got a script that turns a ton of zwave devices off that seems to hit this issue most times it’s triggered.
Edit: an improvement is that I seem to be able to cancel the script even though it was stuck
I'm running with Home Assistant 2023.8.4 a have a trace. I'm on the beta channel so hope to have the latest.
zwavejs_2023-08-31.log trace automation.zonsondergang 2023-08-31T05_53_18.854585+00_00.json.txt
And I don't see the script is cancelled.
I’m seeing the same issue with the 2023.9 beta that was released today. I didn’t have debug logging on but I’ll try again in the morning; I’ve got a script that turns a ton of zwave devices off that seems to hit this issue most times it’s triggered.
Edit: an improvement is that I seem to be able to cancel the script even though it was stuck
are these battery devices or mains powered? Please do share the logging when you have a chance!
Hi, not sure if this is the right place, but I got a simple automation that using an Ikea 2-button remote to turn off/on a light (Ikea Tradfri plug). What is strange is that the 2-button remote triggers in ZHA events for the device AND the same triggers in the automation also trigger (blink blue), but the automation does not complete the actions nor is there a trace of the triggers being triggered. Its really strange. If I edit the automation (like add a space and delete a space) and save it, it works again.
Its not an issue with the devices. They are online and working normally.
Similar comment here https://github.com/home-assistant/core/issues/98073#issuecomment-1673669128
Hey there @home-assistant/z-wave, mind taking a look at this issue as it has been labeled with an integration (zwave_js
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
zwave_js documentation zwave_js source (message by IssueLinks)
Hey there @dmulcahey, @adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
zha documentation zha source (message by IssueLinks)
@Anto79-ops zha right? just want to make sure I am tagging the right folks
yes, ZHA. Started happening in 2023.8.x I'll see if I can get screen video or something.
rather than providing videos and screenshots, I would recommend pulling whatever logs you can and indicate where in the logs your automation started
ok thanks, I'm in beta now and so just restarted HA for b1, but now the automation works again. I will post back here when it stops working, thanks @raman325
OK here's logs. Home assistant version 2023.9.0b0 (docker, raspberrypi4-homeassistant), zwave-js 11.13.0. The script is cancellable but marked "still running" after last night. home-assistant_zwave_js_2023-09-01.headingup.log trace script.1633675778464 2023-09-01T04 57 29.221692+00 00.json.txt
Edit: here are zwavejs logs for the same time period
Hi, latest versions running: Home Assistant 2023.9.0b3 Supervisor 2023.08.3 Operating System 10.5 Frontend-versie: 20230901.0 - latest
After turning off a Z-Wave device automation got stuck on turning off Sonoff devices.
Triggered by the state of input_boolean.lightson at 4 september 2023 om 06:56:50 If: else action executed Kandelaars Voordeur uitzetten (switch.kandelaars_voordeur) turned off (light.ledstrip_garagedeur) turned off Ledstrip Overkapping uitzetten (light.ledstrip_overkapping) turned off Still running
zwavejs_2023-09-04.log trace automation.zonsondergang 2023-09-04T04_56_50.552131+00_00.json.txt
those of you in the beta, please update to b5 and see if that resolves the issue
Haven't had a chance to grab the beta yet - but I've been building a new instance of HA since I'm migrating to a SONOFF Zigbee and a Zooz 800 Z-Wave Stick to see about alleviating the talked about issues of my husbzb-1. This time I'm only using my own docker containers for HA (instead of HAOS), Zigbee2MQTT (instead of ZHA), and Z-Wave JS UI (instead of Z-Wave JS). I've moved all of my automations back from "restart" to "single" and I've not had a single stuck automation the entire time I've been testing as I migrate my 100+ devices.
I'm also experiencing this issue. The last update did not solve it. If required I can also share my logs (although I'm not fully sure what logs I should get and where)
I'm also experiencing this issue.
The last update did not solve it. If required I can also share my logs (although I'm not fully sure what logs I should get and where)
It's specifically a zwave device that's an issue? Please set the addon/server log level to debug as well as the integration and the lib. We will need to see the debug logs from the moment the automation starts to a point where it's clear the automation won't finish
Hi running the latest version and got more problems than before. See attached 4 automations and Z-Wave log. It even looks like a scene is not even working anymore. I have to restart HA to get things working again
Home Assistant 2023.9.0 Supervisor 2023.08.3 Operating System 10.5 Frontend-versie: 20230906.1 - latest
trace automation.bedlamp_avond_aan 2023-09-07T19_30_00.153936+00_00.json.txt
trace automation.buitenlampen_tuin_uit 2023-09-07T18_30_00.396035+00_00..json.txt
trace automation.woonkamer_wakeup_aan 2023-09-07T18_09_31.909739+00_00.json.txt
trace automation.zonsondergang 2023-09-07T18_09_31.909058+00_00.json.txt
can you set the addon log level to debug and reupload if/when this happens again?
Yes, Debug was on. I only had a filter to log only a few Z-Wave nodes which I have removed now.
thanks, please share with the filter off if/when the issue happens again. I am having a hard time parsing the first logs you uploaded.
@raman325 , running now for 3 days with no problems with: Home Assistant 2023.9.1 Supervisor 2023.09.0 Operating System 10.5 Frontend-versie: 20230908.0 - latest
We think this is solved for Z-Wave now. I'll close here now to easier track if further work is needed.
If you still have a problem please open a new issue and describe what device for what integration is affected, with as much data as possible, debug logs etc, so we can triage the problem appropriately.
The problem
Automations are getting stuck in versions of HA core 2023.7.0 through 2023.8.1 after upgrading from 2023.6.3.
What version of Home Assistant Core has the issue?
core-2023 7.0, core-2023 7.1. core-2023 7.2, core-2023 7.3, core-2023 8.0, core-2023 8.1
What was the last working version of Home Assistant Core?
core-2023 6.3
What type of installation are you running?
Home Assistant OS
Integration causing the issue
Z-wave and Zigbee
Link to integration documentation on our website
No response
Diagnostics information
home-assistant.log.2023.6.3.zip
home-assistant.log.2023.7.0.zip
Example YAML snippet
Anything in the logs that might be useful for us?
No response
Additional information
Watching several other issues that have been opened and closed:
https://github.com/home-assistant/core/issues/97965 https://github.com/home-assistant/core/issues/97768 https://github.com/home-assistant/core/issues/97721 https://github.com/home-assistant/core/issues/97662 https://github.com/home-assistant/core/issues/97581 https://community.home-assistant.io/t/already-running-new-automation-bug/596654/16