home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.7k stars 30.45k forks source link

Automations marked as "Still Running" After Upgrade to 2023.7 & 2023.8 #98073

Closed lux4rd0 closed 1 year ago

lux4rd0 commented 1 year ago

The problem

Automations are getting stuck in versions of HA core 2023.7.0 through 2023.8.1 after upgrading from 2023.6.3.

trace1

trace2

trace3

What version of Home Assistant Core has the issue?

core-2023 7.0, core-2023 7.1. core-2023 7.2, core-2023 7.3, core-2023 8.0, core-2023 8.1

What was the last working version of Home Assistant Core?

core-2023 6.3

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Z-wave and Zigbee

Link to integration documentation on our website

No response

Diagnostics information

home-assistant.log.2023.6.3.zip

home-assistant.log.2023.7.0.zip

Example YAML snippet

alias: Office Day On
description: ""
trigger:
  - platform: state
    entity_id: binary_sensor.office_motion_motion
    to: "on"
condition:
  - condition: or
    conditions:
      - condition: state
        entity_id: input_select.mode
        state: Home
action:
  - type: turn_on
    device_id: 6a7dd3564b68685d3dfbfbe0d2429237
    entity_id: light.office_floor_lamp
    domain: light
    brightness_pct: 100
  - type: turn_on
    device_id: 6b3ff3b7d4b89d95e2120316881e62ca
    entity_id: light.office_overhead
    domain: light
    brightness_pct: 100
  - service: zwave_js.bulk_set_partial_config_parameters
    data:
      parameter: "16"
      value: 33884694
    target:
      entity_id:
        - light.office_floor_lamp
        - light.office_overhead
    enabled: false
  - service: light.turn_on
    data:
      effect: Twinkle
    target:
      entity_id:
        - light.esp07_light_1
        - light.esp07_light_2
        - light.esp07_light_3
        - light.esp07_light_4
mode: single

Anything in the logs that might be useful for us?

No response

Additional information

Watching several other issues that have been opened and closed:

https://github.com/home-assistant/core/issues/97965 https://github.com/home-assistant/core/issues/97768 https://github.com/home-assistant/core/issues/97721 https://github.com/home-assistant/core/issues/97662 https://github.com/home-assistant/core/issues/97581 https://community.home-assistant.io/t/already-running-new-automation-bug/596654/16

lux4rd0 commented 1 year ago

I still see my Z-wave devices go "dead," but a quick ping brings them back. I have automation that runs to bring them back online, but they don't always work. If I manually click "ping" on the devices, they return to "Alive."

Before the last few months - I rarely, if ever, had Z-wave device issues. I believe it combines both Z-wave devices getting temporarily marked dead and the automation configuration of the 10-second time-out.

It looks like there's progress on the integration framework, but happy to replicate it again.

It's frustrating enough that I'm about to try and migrate all 54 of my devices from my husbzb-1 to a new Zooz 800 stick and Sonoff Zigbee stick. Knowing that there are some known issues with the older sticks - should I wait to see if there's a fix here or do the work to unpair and repair all of my devices?

Thanks again for all of the hard work tracking these items down!

kimmilde commented 1 year ago

Did a update a couple of days ago, and damn. I have now 4 automation running and have frozen.... I cant continue like this as it messes up everything in the house. Anyone know how far I have to roll back?

raman325 commented 1 year ago

Confirmed that what we think is the fix will be in the beta. I can appreciate the need to be stable and to not create issues elsewhere, but if anyone who's running into this issue would be willing to join the beta and test, it would be much appreciated

mike240se commented 1 year ago

I still see my Z-wave devices go "dead," but a quick ping brings them back. I have automation that runs to bring them back online, but they don't always work. If I manually click "ping" on the devices, they return to "Alive."

Before the last few months - I rarely, if ever, had Z-wave device issues. I believe it combines both Z-wave devices getting temporarily marked dead and the automation configuration of the 10-second time-out.

It looks like there's progress on the integration framework, but happy to replicate it again.

It's frustrating enough that I'm about to try and migrate all 54 of my devices from my husbzb-1 to a new Zooz 800 stick and Sonoff Zigbee stick. Knowing that there are some known issues with the older sticks - should I wait to see if there's a fix here or do the work to unpair and repair all of my devices?

Thanks again for all of the hard work tracking these items down!

I have the same issue with requiring a ping automation to ping dead zwave devices. I have a silabs 700 stick. Please reply back and let me know if changing to a 800 stick fixes the problem. I read you cant restore and have to redo the entire network though?? :(

mike240se commented 1 year ago

Confirmed that what we think is the fix will be in the beta. I can appreciate the need to be stable and to not create issues elsewhere, but if anyone who's running into this issue would be willing to join the beta and test, it would be much appreciated

I would be willing to test but is it trivial to go back on the stable release train after trying a beta?

raman325 commented 1 year ago

It seems like people here really tried to dig into what was going on, which I appreciate, and so I will share the details of the problem/fix to reward you for your efforts (or if you don't care, feel free to ignore this message):

The problem (we think): TLDR it was a timing problem that was caused by having a distributed environment.

The driver: Z-Wave JS maintains node statuses for devices: alive/dead and awake/asleep. I believe the only transitions device can make are between those pairs (You can learn more here: https://zwave-js.github.io/node-zwave-js/#/api/node?id=status). When a command is issued to a device that is alive or awake, the driver immediately passes the command along and presumably gets a response from the device. When a command is issued to a device that is asleep or dead, then it adds the message to a queue and waits for the node to get back to the other statuses before sending the message along.

The integration library: When we know a message is immediately sent to the device, we'd like to wait for the response so we know what the result was; this is obviously particularly useful in the case of a failure. When we think a message is queued, we won't wait for the response because we don't know how long it's going to take and we don't want to hang indefinitely.

We think what is happening is that the library thinks the device is alive/awake when we send the command, so we wait for the response, but then the device goes to asleep/dead, so the driver queues the message. Basically, what we try to avoid with the above logic would happen, and the end result would be a hanging service call because of the removal of the 10 second timeout.

The potential fix: It was fortunately relatively simple - we'll still wait for the response if the device is awake/alive, but if the device changes statuses, the library will know and will stop waiting. The message is queued on the driver side still so presumably it will get sent at some point, but we aren't going to wait for that. That's effectively what we think the 10 second timeout was doing - we were in a state where we were waiting for more than ten seconds, so HA cancelled the call, but the message was still queued.

Assuming this is the fix, thank you again to those who shared all of the details that helped us reach this conclusion, which wasn't obvious without having that level of detail.

kramttocs commented 1 year ago

Thanks @raman325 for that update and the work spent on this. That zwave instability issue sure does cause some pain across the board. :( Though adding more robustness like this sounds like a good thing in the long run regardless.

tdejneka commented 1 year ago

When a command is issued to a device that is alive or awake, the driver immediately passes the command along and presumably gets a response from the device. When a command is issued to a device that is asleep or dead, then it adds the message to a queue and waits for the node to get back to the other statuses before sending the message along.

Would you happen to know what other integrations operate this way thereby making them susceptible to the bug?

For example, ZHA has been reported to cause the 'stuck' automation problem so does it use the same "queueing when asleep/dead" mechanism you described?

raman325 commented 1 year ago

When a command is issued to a device that is alive or awake, the driver immediately passes the command along and presumably gets a response from the device. When a command is issued to a device that is asleep or dead, then it adds the message to a queue and waits for the node to get back to the other statuses before sending the message along.

Would you happen to know what other integrations operate this way thereby making them susceptible to the bug?

For example, ZHA has been reported to cause the 'stuck' automation problem so does it use the same "queueing when asleep/dead" mechanism you described?

No idea but it wouldn't surprise me if it's in a similar class of issues. These mesh networks are designed for unreliability so things probably do get queued, and I think the general desire that I described for how to handle it (when to wait versus when not to wait) is not specific to Z-Wave in an environment where you want service calls to return reasonably fast

ravimohta commented 1 year ago

Confirmed that what we think is the fix will be in the beta. I can appreciate the need to be stable and to not create issues elsewhere, but if anyone who's running into this issue would be willing to join the beta and test, it would be much appreciated

I would like to be part of the beta test for this as I am also facing this issue daily.

Kindly let me know how to join the beta release/test.

Thx

raman325 commented 1 year ago

Thanks @raman325 for that update and the work spent on this. That zwave instability issue sure does cause some pain across the board. :( Though adding more robustness like this sounds like a good thing in the long run regardless.

  1. It wasn't me, it was a collection of people who figured it out, I just voted myself in as spokesperson 🙂
  2. Yes it's unfortunate that this happened but your last statement is 100% correct, this is ultimately better
raman325 commented 1 year ago

Kindly let me know how to join the beta release/test.

go to https://www.home-assistant.io/ and type in beta in the upper right search bar. There will be multiple results, each for a different installation method. Pick the one that you use and that should get you sorted

emoses commented 1 year ago

I’m seeing the same issue with the 2023.9 beta that was released today. I didn’t have debug logging on but I’ll try again in the morning; I’ve got a script that turns a ton of zwave devices off that seems to hit this issue most times it’s triggered.

Edit: an improvement is that I seem to be able to cancel the script even though it was stuck

ridderr commented 1 year ago

I'm running with Home Assistant 2023.8.4 a have a trace. I'm on the beta channel so hope to have the latest.

zwavejs_2023-08-31.log trace automation.zonsondergang 2023-08-31T05_53_18.854585+00_00.json.txt

And I don't see the script is cancelled.

raman325 commented 1 year ago

I’m seeing the same issue with the 2023.9 beta that was released today. I didn’t have debug logging on but I’ll try again in the morning; I’ve got a script that turns a ton of zwave devices off that seems to hit this issue most times it’s triggered.

Edit: an improvement is that I seem to be able to cancel the script even though it was stuck

are these battery devices or mains powered? Please do share the logging when you have a chance!

Anto79-ops commented 1 year ago

Hi, not sure if this is the right place, but I got a simple automation that using an Ikea 2-button remote to turn off/on a light (Ikea Tradfri plug). What is strange is that the 2-button remote triggers in ZHA events for the device AND the same triggers in the automation also trigger (blink blue), but the automation does not complete the actions nor is there a trace of the triggers being triggered. Its really strange. If I edit the automation (like add a space and delete a space) and save it, it works again.

Its not an issue with the devices. They are online and working normally.

Similar comment here https://github.com/home-assistant/core/issues/98073#issuecomment-1673669128

home-assistant[bot] commented 1 year ago

Hey there @home-assistant/z-wave, mind taking a look at this issue as it has been labeled with an integration (zwave_js) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zwave_js` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zwave_js` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


zwave_js documentation zwave_js source (message by IssueLinks)

home-assistant[bot] commented 1 year ago

Hey there @dmulcahey, @adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

raman325 commented 1 year ago

@Anto79-ops zha right? just want to make sure I am tagging the right folks

Anto79-ops commented 1 year ago

yes, ZHA. Started happening in 2023.8.x I'll see if I can get screen video or something.

raman325 commented 1 year ago

rather than providing videos and screenshots, I would recommend pulling whatever logs you can and indicate where in the logs your automation started

Anto79-ops commented 1 year ago

ok thanks, I'm in beta now and so just restarted HA for b1, but now the automation works again. I will post back here when it stops working, thanks @raman325

emoses commented 1 year ago

OK here's logs. Home assistant version 2023.9.0b0 (docker, raspberrypi4-homeassistant), zwave-js 11.13.0. The script is cancellable but marked "still running" after last night. home-assistant_zwave_js_2023-09-01.headingup.log trace script.1633675778464 2023-09-01T04 57 29.221692+00 00.json.txt

Edit: here are zwavejs logs for the same time period

zwaveui-2023-08-31.headingup.log

ridderr commented 1 year ago

Hi, latest versions running: Home Assistant 2023.9.0b3 Supervisor 2023.08.3 Operating System 10.5 Frontend-versie: 20230901.0 - latest

After turning off a Z-Wave device automation got stuck on turning off Sonoff devices.

Triggered by the state of input_boolean.lightson at 4 september 2023 om 06:56:50 If: else action executed Kandelaars Voordeur uitzetten (switch.kandelaars_voordeur) turned off (light.ledstrip_garagedeur) turned off Ledstrip Overkapping uitzetten (light.ledstrip_overkapping) turned off Still running

zwavejs_2023-09-04.log trace automation.zonsondergang 2023-09-04T04_56_50.552131+00_00.json.txt

raman325 commented 1 year ago

those of you in the beta, please update to b5 and see if that resolves the issue

lux4rd0 commented 1 year ago

Haven't had a chance to grab the beta yet - but I've been building a new instance of HA since I'm migrating to a SONOFF Zigbee and a Zooz 800 Z-Wave Stick to see about alleviating the talked about issues of my husbzb-1. This time I'm only using my own docker containers for HA (instead of HAOS), Zigbee2MQTT (instead of ZHA), and Z-Wave JS UI (instead of Z-Wave JS). I've moved all of my automations back from "restart" to "single" and I've not had a single stuck automation the entire time I've been testing as I migrate my 100+ devices.

deam0n commented 1 year ago

I'm also experiencing this issue. The last update did not solve it. If required I can also share my logs (although I'm not fully sure what logs I should get and where)

raman325 commented 1 year ago

I'm also experiencing this issue.

The last update did not solve it. If required I can also share my logs (although I'm not fully sure what logs I should get and where)

It's specifically a zwave device that's an issue? Please set the addon/server log level to debug as well as the integration and the lib. We will need to see the debug logs from the moment the automation starts to a point where it's clear the automation won't finish

ridderr commented 1 year ago

Hi running the latest version and got more problems than before. See attached 4 automations and Z-Wave log. It even looks like a scene is not even working anymore. I have to restart HA to get things working again

Home Assistant 2023.9.0 Supervisor 2023.08.3 Operating System 10.5 Frontend-versie: 20230906.1 - latest

trace automation.bedlamp_avond_aan 2023-09-07T19_30_00.153936+00_00.json.txt

trace automation.buitenlampen_tuin_uit 2023-09-07T18_30_00.396035+00_00..json.txt

trace automation.woonkamer_wakeup_aan 2023-09-07T18_09_31.909739+00_00.json.txt

trace automation.zonsondergang 2023-09-07T18_09_31.909058+00_00.json.txt

zwavejs_2023-09-07.log

raman325 commented 1 year ago

zwavejs_2023-09-07.log

can you set the addon log level to debug and reupload if/when this happens again?

ridderr commented 1 year ago

Yes, Debug was on. I only had a filter to log only a few Z-Wave nodes which I have removed now.

raman325 commented 1 year ago

thanks, please share with the filter off if/when the issue happens again. I am having a hard time parsing the first logs you uploaded.

ridderr commented 1 year ago

@raman325 , running now for 3 days with no problems with: Home Assistant 2023.9.1 Supervisor 2023.09.0 Operating System 10.5 Frontend-versie: 20230908.0 - latest

MartinHjelmare commented 1 year ago

We think this is solved for Z-Wave now. I'll close here now to easier track if further work is needed.

If you still have a problem please open a new issue and describe what device for what integration is affected, with as much data as possible, debug logs etc, so we can triage the problem appropriately.