home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.7k stars 30.45k forks source link

Automations marked as "Still Running" After Upgrade to 2023.7 & 2023.8 #98073

Closed lux4rd0 closed 1 year ago

lux4rd0 commented 1 year ago

The problem

Automations are getting stuck in versions of HA core 2023.7.0 through 2023.8.1 after upgrading from 2023.6.3.

trace1

trace2

trace3

What version of Home Assistant Core has the issue?

core-2023 7.0, core-2023 7.1. core-2023 7.2, core-2023 7.3, core-2023 8.0, core-2023 8.1

What was the last working version of Home Assistant Core?

core-2023 6.3

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Z-wave and Zigbee

Link to integration documentation on our website

No response

Diagnostics information

home-assistant.log.2023.6.3.zip

home-assistant.log.2023.7.0.zip

Example YAML snippet

alias: Office Day On
description: ""
trigger:
  - platform: state
    entity_id: binary_sensor.office_motion_motion
    to: "on"
condition:
  - condition: or
    conditions:
      - condition: state
        entity_id: input_select.mode
        state: Home
action:
  - type: turn_on
    device_id: 6a7dd3564b68685d3dfbfbe0d2429237
    entity_id: light.office_floor_lamp
    domain: light
    brightness_pct: 100
  - type: turn_on
    device_id: 6b3ff3b7d4b89d95e2120316881e62ca
    entity_id: light.office_overhead
    domain: light
    brightness_pct: 100
  - service: zwave_js.bulk_set_partial_config_parameters
    data:
      parameter: "16"
      value: 33884694
    target:
      entity_id:
        - light.office_floor_lamp
        - light.office_overhead
    enabled: false
  - service: light.turn_on
    data:
      effect: Twinkle
    target:
      entity_id:
        - light.esp07_light_1
        - light.esp07_light_2
        - light.esp07_light_3
        - light.esp07_light_4
mode: single

Anything in the logs that might be useful for us?

No response

Additional information

Watching several other issues that have been opened and closed:

https://github.com/home-assistant/core/issues/97965 https://github.com/home-assistant/core/issues/97768 https://github.com/home-assistant/core/issues/97721 https://github.com/home-assistant/core/issues/97662 https://github.com/home-assistant/core/issues/97581 https://community.home-assistant.io/t/already-running-new-automation-bug/596654/16

home-assistant[bot] commented 1 year ago

Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (automation) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `automation` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign automation` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


automation documentation automation source (message by IssueLinks)

jvd-nl commented 1 year ago

Might have the same issue here. When mode=restart, the automation is never restarted (no new trace, no updated timestamp), making a Home Assistant restart the only way to fix this. There is no relevant logging.

criticallimit commented 1 year ago

Don't know if it is the same issue, but sometimes my automations are not triggered although the triggering device shows the state change in the logbook. All Zigbee devices. Automations are running without faults manually triggered.

jvd-nl commented 1 year ago

Manually running doesn't work here either.

lux4rd0 commented 1 year ago

Upgraded to 2023.8.3 to try things out... Still happening... Providing some trace files

trace automation.sunroom_day_on 2023-08-12T17_28_39.484164+00_00.json.zip trace automation.butlers_pantry_day_off 2023-08-12T17_34_35.287346+00_00.json.zip

These are two traces that get shown as "still running"

The difference this time around is that I can manually go in and "disable" the affected automation without getting the system to move into an unresponsive state for automation.

I had four total automations get into a "still running" in about 2 hours after my upgrade. Each could be disabled and reenabled without having to restart HA core.

raym777 commented 1 year ago

I have also started seeing this issue. I haven't found an obvious reason in the logs but my automations are now hanging regularly. Not the same ones, could be any and at random times but for me, it also appears related to when switching Z-Wave devices.

asayler commented 1 year ago

Seeing the same issue. Commented on #97721.

allenporter commented 1 year ago

@lux4rd0 In your trace examples that aren't finished, is there another trace that happened around the same time? That is, my impression is that the mode behavior is about when an automation is triggered multiple times in parallel so i'm wondering if there was another action somewhere else that fired around the same time, or if its just that this stuck action is the one preventing others from happening.

HVR88 commented 1 year ago

I can get the "still running" even if the mode is set to queued.

I'm seeing other automation fail or partially fail without the "still running" too (only since 2023.7.x), but I'll have to track that stuff down for a different report.

lux4rd0 commented 1 year ago

@lux4rd0 In your trace examples that aren't finished, is there another trace that happened around the same time? That is, my impression is that the mode behavior is about when an automation is triggered multiple times in parallel so i'm wondering if there was another action somewhere else that fired around the same time, or if its just that this stuck action is the one preventing others from happening.

If one automation gets stuck - it's just for that one automation. Eventually, I'll get a second or third, or fourth one. But there's no dependency that I can determine. It's also not the same automation or in the same stacked automation order every time.

When an automation gets stuck, the "already running" indicates an issue - not the problem itself.

I will say that 2023.8.2 has been nice because I can disable and enable the "stuck" automation without rebooting HA. That was not the case in the previous 2023.7 / 2023.8.0/1 versions.

Screenshot 2023-08-14 at 5 52 59 PM

Using this Dashboard:

views:
  - title: Home
    cards:
      - type: custom:auto-entities
        card:
          show_name: true
          show_icon: false
          show_state: false
          type: glance
          title: Running Automations
          columns: 1
        filter:
          include:
            - domain: automation
          exclude:
            - attributes:
                current: 0
        sort:
          method: last_triggered
        show_empty: true
title: Running Automations

I can click on "running automation" and disable and re-enable it from the panel. I might get about an hour or two before something gets stuck again...

trace automation.sunroom_day_on 2023-08-14T22_34_38.501316+00_00.json.zip

That's the trace for the "Sunroom Day On" that's marked as "Still Running"

allenporter commented 1 year ago

So far it seems like device actions for light with brightness_pct set seems to be persent in most reports of this. Perhaps, though, its just common to have an automation to do that.

I might suggest trying variations on the automations e.g. not device actions or other device actions that don't change light

lux4rd0 commented 1 year ago

So far it seems like device actions for light with brightness_pct set seems to be persent in most reports of this. Perhaps, though, its just common to have an automation to do that.

I might suggest trying variations on the automations e.g. not device actions or other device actions that don't change light

I've had automation get stuck that only turn off lights. So not 100% the case. However - all of my turn-on actions have percentages of brightness because I have both "Day" and "Night" automations. (100% and 10% brightness).

kramttocs commented 1 year ago

So far it seems like device actions for light with brightness_pct set seems to be persent in most reports of this. Perhaps, though, its just common to have an automation to do that.

I might suggest trying variations on the automations e.g. not device actions or other device actions that don't change light

Mine is just turning a group (helper group of 2 non-dimming zwave light switches) off so no explicit percentages specified. Assuming off != 0%

alray31 commented 1 year ago

So far it seems like device actions for light with brightness_pct set seems to be persent in most reports of this. Perhaps, though, its just common to have an automation to do that. I might suggest trying variations on the automations e.g. not device actions or other device actions that don't change light

I've had automation get stuck that only turn off lights. So not 100% the case.

Same here, one of my automation that get stuck in the ''still runing'' state, only does at the last action, a service call to turn off a z-wave switch preceded by a service call to turn off many lights.

I can also tell you that the switch is one that will often be affected by the dead node issue. But before, that automation would only fail and log an error with the node being unresponsive. Now it will just hang forever in the ''still runing'' state until I reboot HA or until I reboot Z-wave JS UI

image

lux4rd0 commented 1 year ago

I've made an test automation that simply uses a service call to send my phone a notification instead of calling the z-wave service. That particular one hasn't hung yet... I'll keep monitoring.

allenporter commented 1 year ago

Perhaps you can confirm if one of the symptoms is really related to flaky devices.

I am reading some of the changes and it seems like before what would happen is the service would wait for a timeout then proceed anyway even if the call timed out. I think what we should do instead is timeout explicitly, and fail, and allow use of continue_on_error to continue anyway.

asayler commented 1 year ago

I managed to catch one of the traces that gets stuck running:

{
  "trace": {
    "last_step": "action/0",
    "run_id": "81ef4eb99807a72c7fb2d588625899eb",
    "state": "running",
    "script_execution": null,
    "timestamp": {
      "start": "2023-08-15T12:07:11.510689+00:00",
      "finish": null
    },
    "domain": "automation",
    "item_id": "1687041579731",
    "trigger": "state of input_select.tod_mode",
    "trace": {
      "trigger/0": [
        {
          "path": "trigger/0",
          "timestamp": "2023-08-15T12:07:11.510815+00:00",
          "changed_variables": {
            "this": {
              "entity_id": "automation.outdoor_lights_off_for_day",
              "state": "on",
              "attributes": {
                "id": "1687041579731",
                "last_triggered": "2023-08-14T12:06:13.259008+00:00",
                "mode": "single",
                "current": 0,
                "friendly_name": "Outdoor Lights Off for Day"
              },
              "last_changed": "2023-08-14T04:38:49.568511+00:00",
              "last_updated": "2023-08-14T12:06:15.172513+00:00",
              "context": {
                "id": "01H7SZ110AG2F3SX5X8YTSGZT6",
                "parent_id": "01H7SZ10ZW7MXMM3RGQSQW1C06",
                "user_id": null
              }
            },
            "trigger": {
              "id": "0",
              "idx": "0",
              "alias": null,
              "platform": "state",
              "entity_id": "input_select.tod_mode",
              "from_state": {
                "entity_id": "input_select.tod_mode",
                "state": "Overnight",
                "attributes": {
                  "options": [
                    "Day",
                    "Evening",
                    "Overnight"
                  ],
                  "editable": true,
                  "icon": "mdi:theme-light-dark",
                  "friendly_name": "ToD Mode"
                },
                "last_changed": "2023-08-15T06:00:00.358548+00:00",
                "last_updated": "2023-08-15T06:00:00.358548+00:00",
                "context": {
                  "id": "01H7VWF6337XMQ8MATNX46FNGY",
                  "parent_id": null,
                  "user_id": null
                }
              },
              "to_state": {
                "entity_id": "input_select.tod_mode",
                "state": "Day",
                "attributes": {
                  "options": [
                    "Day",
                    "Evening",
                    "Overnight"
                  ],
                  "editable": true,
                  "icon": "mdi:theme-light-dark",
                  "friendly_name": "ToD Mode"
                },
                "last_changed": "2023-08-15T12:07:11.510254+00:00",
                "last_updated": "2023-08-15T12:07:11.510254+00:00",
                "context": {
                  "id": "01H7WHFGWK3VJCDY59DT4DSXQE",
                  "parent_id": null,
                  "user_id": null
                }
              },
              "for": null,
              "attribute": null,
              "description": "state of input_select.tod_mode"
            }
          }
        }
      ],
      "action/0": [
        {
          "path": "action/0",
          "timestamp": "2023-08-15T12:07:11.511408+00:00",
          "changed_variables": {
            "context": {
              "id": "01H7WHFGWPSQN0X99QKDNWQNNA",
              "parent_id": "01H7WHFGWK3VJCDY59DT4DSXQE",
              "user_id": null
            }
          },
          "result": {
            "params": {
              "domain": "scene",
              "service": "turn_on",
              "service_data": {},
              "target": {
                "entity_id": [
                  "scene.outdoor_day"
                ]
              }
            },
            "running_script": false
          }
        }
      ]
    },
    "config": {
      "id": "1687041579731",
      "alias": "Outdoor Lights Off for Day",
      "description": "",
      "trigger": [
        {
          "platform": "state",
          "entity_id": [
            "input_select.tod_mode"
          ],
          "to": "Day"
        }
      ],
      "condition": [],
      "action": [
        {
          "service": "scene.turn_on",
          "target": {
            "entity_id": "scene.outdoor_day"
          },
          "metadata": {}
        }
      ],
      "mode": "single"
    },
    "blueprint_inputs": null,
    "context": {
      "id": "01H7WHFGWPSQN0X99QKDNWQNNA",
      "parent_id": "01H7WHFGWK3VJCDY59DT4DSXQE",
      "user_id": null
    }
  },
  "logbookEntries": [
    {
      "name": "Outdoor Lights Off for Day",
      "message": "triggered by state of input_select.tod_mode",
      "source": "state of input_select.tod_mode",
      "entity_id": "automation.outdoor_lights_off_for_day",
      "context_id": "01H7WHFGWPSQN0X99QKDNWQNNA",
      "when": 1692101231.510949,
      "domain": "automation"
    },
    {
      "when": 1692101231.512982,
      "state": "2023-08-15T12:07:11.512857+00:00",
      "entity_id": "scene.outdoor_day",
      "icon": "mdi:lightbulb-group-off",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    },
    {
      "when": 1692101231.606344,
      "state": "off",
      "entity_id": "light.back_porch_light",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    },
    {
      "when": 1692101231.752126,
      "state": "off",
      "entity_id": "light.front_porch_light",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    },
    {
      "when": 1692101232.626016,
      "state": "off",
      "entity_id": "light.deck_down_lights_white",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    },
    {
      "when": 1692101232.819127,
      "state": "unavailable",
      "entity_id": "light.deck_down_lights_white",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    },
    {
      "when": 1692101234.525657,
      "state": "off",
      "entity_id": "light.driveway_light",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    },
    {
      "when": 1692101234.534319,
      "state": "off",
      "entity_id": "light.patio_door_light",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    },
    {
      "when": 1692101235.525013,
      "state": "off",
      "entity_id": "light.deck_up_lights_white",
      "context_event_type": "automation_triggered",
      "context_domain": "automation",
      "context_name": "Outdoor Lights Off for Day",
      "context_message": "triggered by state of input_select.tod_mode",
      "context_source": "state of input_select.tod_mode",
      "context_entity_id": "automation.outdoor_lights_off_for_day"
    }
  ]
}

The automation it corresponds to is as follows:

alias: Outdoor Lights Off for Day
description: ""
trigger:
  - platform: state
    entity_id:
      - input_select.tod_mode
    to: Day
condition: []
action:
  - service: scene.turn_on
    target:
      entity_id: scene.outdoor_day
    metadata: {}
mode: single

The scene in question controls a set of Z-wave devices. It looks like one of them failed to trigger (and got marked as unavailable by Z-wave JS) during the execution.

I agree that in cases like this, it would be better for the automation to timeout and log an error rather than run indefinitely.

steve28 commented 1 year ago

I started seeing this sometime in the 2023.7 series as well. I see it in two automations that are simply motion lights. I.e., turn on with motion, turn off with no-motion for X min. Both are zigbee motion sensors triggering z-wave dimmer switches. One of them issues turn_on/turn_off, and the other issues turn_on with brightness parameters depending on time of day.

lux4rd0 commented 1 year ago

I just had two different automations both using the same Zigbee motion sensor to trigger a Z-Wave light and a SMTP notification. The Z-wave got stuck - the SMTP notification worked fine.

Clearly something hung up with Z-wave devices. All of my Z-wave devices are online.

iDontWantAUsername commented 1 year ago

Given the linked bug (https://github.com/home-assistant/core/pull/98501) has been rejected does this mean the issue is one for owners of light/switch.turn_off to fix?

For some context the my case (https://github.com/home-assistant/core/issues/98491) seems to be for Zwave (and probably ZigBee) devices go to a state other than Off, such as Unavailable, after the turn_off command is sent.

At least with Scripts we can monitor if it has been running for more than a certain amount of time and then run the script.turn_off service. With automations that get stuck running it is impossible to see and kill it without going in and updating the automation or reloading all automations.

allenporter commented 1 year ago

Thanks for following up, apologies for the delay.

The decision is that we'll only fix the root causes of the integration bugs. (The timeout was meant to be a stopgap, but it doesn't fix the underling issue so we'll switch to focus on this.).

@lux4rd0 i think we should split this into one issue for zwave and one issue for zigbee, and we'll need to get the integration specific part of the issue spelled out. (It may be already i have not re-reviewed this issue)

allenporter commented 1 year ago

@iDontWantAUsername i reopened #98491 and updated it and assigned to zwave.

lux4rd0 commented 1 year ago

Thanks for following up, apologies for the delay.

The decision is that we'll only fix the root causes of the integration bugs. (The timeout was meant to be a stopgap, but it doesn't fix the underling issue so we'll switch to focus on this.).

@lux4rd0 i think we should split this into one issue for zwave and one issue for zigbee, and we'll need to get the integration specific part of the issue spelled out. (It may be already i have not re-reviewed this issue)

@allenporter After continuing to try different approaches to the automation issues - it does seem only to involve Z-Wave. Automations that only use Zigbee to perform other service calls aren't caught up in the "Still Running" issue. I keep a list of node status for my Z-Wave devices in "Dead, Asleep, Alive, and Unavailable" - the hangups happen when I show all devices Asleep or Alive. However - hangups in the Z-Wave add-on will eventually snag a few dead nodes. A restart of the Z-Wave add-on brings all of the dead nodes back online. However - every restart of the Z-Wave add-on results in a full restart of HA Core. Even some of my changes to Z-Wave devices will cause HA Core to restart.

Happy to assist in any way useful... Just need to know how to help. Thanks!

alray31 commented 1 year ago

Thanks for following up, apologies for the delay. The decision is that we'll only fix the root causes of the integration bugs. (The timeout was meant to be a stopgap, but it doesn't fix the underling issue so we'll switch to focus on this.). @lux4rd0 i think we should split this into one issue for zwave and one issue for zigbee, and we'll need to get the integration specific part of the issue spelled out. (It may be already i have not re-reviewed this issue)

@allenporter After continuing to try different approaches to the automation issues - it does seem only to involve Z-Wave. Automations that only use Zigbee to perform other service calls aren't caught up in the "Still Running" issue. I keep a list of node status for my Z-Wave devices in "Dead, Asleep, Alive, and Unavailable" - the hangups happen when I show all devices Asleep or Alive. However - hangups in the Z-Wave add-on will eventually snag a few dead nodes. A restart of the Z-Wave add-on brings all of the dead nodes back online. However - every restart of the Z-Wave add-on results in a full restart of HA Core. Even some of my changes to Z-Wave devices will cause HA Core to restart.

Happy to assist in any way useful... Just need to know how to help. Thanks!

Could it be related to the infamous dead node issue when using Zwave-js-UI with a 700 series coordinator? Juste saying because I'm under the impressions that the automations might hang because of nodes that are temporarily marked as ''dead''. I can see in my log some errors like ''the node did not acknowledged the command'' followed by ''the node is dead'' for the same node. Sometimes its also ''the node could not decode the message'' followed by ''the node is dead''. And I'm also having the ''invalid payload'' error every now and then. It looks like these errors match with the trigger time of the problematic automations...

steve28 commented 1 year ago

I have a 500 series controller and see it, so it’s not limited to 700 seriesOn Aug 18, 2023, at 1:08 PM, Alain @.***> wrote:

Thanks for following up, apologies for the delay. The decision is that we'll only fix the root causes of the integration bugs. (The timeout was meant to be a stopgap, but it doesn't fix the underling issue so we'll switch to focus on this.). @lux4rd0 i think we should split this into one issue for zwave and one issue for zigbee, and we'll need to get the integration specific part of the issue spelled out. (It may be already i have not re-reviewed this issue)

@allenporter After continuing to try different approaches to the automation issues - it does seem only to involve Z-Wave. Automations that only use Zigbee to perform other service calls aren't caught up in the "Still Running" issue. I keep a list of node status for my Z-Wave devices in "Dead, Asleep, Alive, and Unavailable" - the hangups happen when I show all devices Asleep or Alive. However - hangups in the Z-Wave add-on will eventually snag a few dead nodes. A restart of the Z-Wave add-on brings all of the dead nodes back online. However - every restart of the Z-Wave add-on results in a full restart of HA Core. Even some of my changes to Z-Wave devices will cause HA Core to restart. Happy to assist in any way useful... Just need to know how to help. Thanks!

Could it be related to the infamous dead node issue when using Zwave-js-UI with a 700 series coordinator? Juste saying because I'm under the impressions that the automations might hang because of nodes that are temporarily marked as ''dead''. I can see in my log some errors like ''the node did not acknowledged the command'' followed by ''the node is dead'' for the same node. Sometimes its also ''the node could not decode the message'' followed by ''the node is dead''. And I'm also having the ''invalid payload'' error every now and then. It looks like these errors match with the trigger time of the problematic automations...

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

ravimohta commented 1 year ago

Hi, I am having the same problem with automations 'still running', ans this is happening with z-wave lights or z-wave nodes in other words.

I am also trying out various things to see how to fix this issue.

Yesterday I decided to try one such automation to turn on the light when presence is detected in a specific area .. and what I did was to set the automation mode to parallel. This did start turning the zwavr light on and off correctly, but later on I found there were 2 instanced.of the automations still running (with max limit of 10).

Today I tried to find if there is way to terminate the running instances of the automations .. and apparently to my knowledge so far ther isn't one.

I did however find that instead of turning the lights on within the automation, I can write a script which turns the zwave lights on/off and then call this script from the automation.

The difference here would be is that a script running for long period if time can be terminated/killed apparently whereas an automation can't be.

So I plan to try this soon.

This does not resolve the underlying issue with zwave nodes switching between unavailable/dear and other statuses and potentially causing these issues.

Thanks.

allenporter commented 1 year ago

@ravimohta and @steve28 I would suggest filing new issues about the underlying issue with zwave nodes you are seeing. I suggest a new issue given the symptoms may be the same but the root cause may be specific to your specific issue. I would suggest filling in the details in the new issue template which includes diagnostics, or errors from the logs -- or even putting the integration in debug mode to get additional data.

HVR88 commented 1 year ago

I would suggest filing new issues about the underlying issue with zwave nodes

In my experience, filing bugs on Z-Wave leads to some frustration when the reports go stale. HA's Z-Wave is sort of like a bus without a driver. Hopefully other commenters have better luck.

allenporter commented 1 year ago

@HVR88 see https://community.home-assistant.io/t/2023-8-update-breaks-automations-with-mode-restart/598675/162 we have a z-wave developer going very far out of their way to provide assistance and asking for issue reports. We've been having better success getting to the bottom of these reports with people actually filing their own issues, otherwise what happens is we're conflating totally unrelated symptoms together and not getting anywhere.

lux4rd0 commented 1 year ago

@allenporter I am willing to provide any necessary logs. Would you prefer I upload them here or in the community post you directed me to? The community moderators have requested that issues be opened in the same post, so I am unsure about the most effective way to ensure that the right people review the relevant information.

lux4rd0 commented 1 year ago

I updated several of my automations to now include the:

continue_on_error: true

alias: Tyle 2nd Floor Day On
description: ""
trigger:
  - platform: state
    entity_id:
      - binary_sensor.tyle_2nd_floor_motion
    to: "on"
condition:
  - condition: state
    state: Home
    entity_id: input_select.mode
action:
  - continue_on_error: true
    service: light.turn_on
    data:
      brightness: 255
    target:
      device_id:
        - ba4b4344b950d8da72baee7faebfda4f
        - 9d7a88f43933517530bcaad51f4277b5
        - be516f24ba9259e72f46d05411da33e8
        - 63f8e9c813765c081ea8a3c6e70ea612
mode: single

Thinking I'd just keep things going while we worked on solving the underlying issues. But it actually doesn't address the:

"Still running"

message

I have Z-wave debug turned on, but not seeing the specific logs I should be looking for. I also don't have any Dead z-wave nodes.

steve28 commented 1 year ago

Ok I caught one happening this morning. I have an automation that turns off some lights at sunrise. This morning when I left the house, I noticed they were still on, examined the automation and it shows still running.

I had the z-wave debug logs running, but not HA - and there are no errors/warnings in the HA logs at this time. But the z-wave logs and automation trace is attached.

Here's what I think is the relevant section (Node 59 is the relay I use for these lights) - below is the only two interactions with this node in the log):

2023-08-21T13:17:28.982Z DRIVER » [Node 059] [REQ] [SendData]
                                  │ transmit options: 0x25
                                  │ callback id:      170
                                  └─[MultiChannelCCCommandEncapsulation]
                                    │ source:      0
                                    │ destination: 1
                                    └─[BinarySwitchCCSet]
                                        target value: false
2023-08-21T13:17:28.984Z SERIAL « [ACK]                                                                   (0x06)
2023-08-21T13:17:28.988Z SERIAL « 0x0104011301e8                                                       (6 bytes)
2023-08-21T13:17:28.989Z SERIAL » [ACK]                                                                   (0x06)
2023-08-21T13:17:28.989Z DRIVER « [RES] [SendData]
                                    was sent: true
2023-08-21T13:17:36.407Z SERIAL « 0x01180013aa0102e6007f7f7f7f7f01010700000000020c0e3bf8              (26 bytes)
2023-08-21T13:17:36.408Z SERIAL » [ACK]                                                                   (0x06)
2023-08-21T13:17:36.409Z DRIVER « [REQ] [SendData]
                                    callback id:            170
                                    transmit status:        NoAck, took 7420 ms
                                    routing attempts:       12
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    TX channel no.:         1
                                    route failed here:      14 -> 59
2023-08-21T13:17:36.413Z SERIAL » 0x010e00133b07600d000125010025aa19                                  (16 bytes)
2023-08-21T13:17:36.413Z DRIVER » [Node 059] [REQ] [SendData]
                                  │ transmit options: 0x25
                                  │ callback id:      170
                                  └─[MultiChannelCCCommandEncapsulation]
                                    │ source:      0
                                    │ destination: 1
                                    └─[BinarySwitchCCSet]
                                        target value: false
2023-08-21T13:17:36.416Z SERIAL « [ACK]                                                                   (0x06)
2023-08-21T13:17:36.420Z SERIAL « 0x0104011301e8                                                       (6 bytes)
2023-08-21T13:17:36.420Z SERIAL » [ACK]                                                                   (0x06)
2023-08-21T13:17:36.420Z DRIVER « [RES] [SendData]
                                    was sent: true
2023-08-21T13:17:39.489Z SERIAL « 0x01180013aa00013302be7f7f7f7f0101053c1000000205103bd5              (26 bytes)
2023-08-21T13:17:39.489Z SERIAL » [ACK]                                                                   (0x06)
2023-08-21T13:17:39.490Z DRIVER « [REQ] [SendData]
                                    callback id:            170
                                    transmit status:        OK, took 3070 ms
                                    repeater node IDs:      60, 16
                                    routing attempts:       5
                                    protocol & route speed: Z-Wave, 40 kbit/s
                                    ACK RSSI:               -66 dBm
                                    ACK RSSI on repeaters:  N/A, N/A
                                    ACK channel no.:        1
                                    TX channel no.:         1
                                    route failed here:      16 -> 59
2023-08-21T13:17:39.491Z DRIVER   all queues idle
2023-08-21T13:17:44.491Z DRIVER   all queues busy
2023-08-21T13:17:44.492Z SERIAL » 0x0103003bc7                                                         (5 bytes)
2023-08-21T13:17:44.492Z DRIVER » [REQ] [GetBackgroundRSSI]
2023-08-21T13:17:44.494Z SERIAL « [ACK]                                                                   (0x06)
2023-08-21T13:17:44.495Z SERIAL « 0x0105013ba7a2c5                                                     (7 bytes)
2023-08-21T13:17:44.495Z SERIAL » [ACK]                                                                   (0x06)
2023-08-21T13:17:44.496Z DRIVER « [RES] [GetBackgroundRSSI]
                                    channel 0: -89 dBm
                                    channel 1: -94 dBm
2023-08-21T13:17:44.497Z DRIVER   all queues idle
2023-08-21T13:18:03.553Z DRIVER   all queues busy
2023-08-21T13:18:03.554Z SERIAL » 0x010f00134d086c019204260100ff25ab0b                                (17 bytes)

It looks like the z-wave attempt failed twice and that took over 10s. Note, I can manually command the lights on/off with no issue.

The automation trace and z-wave log is attached zwavejs_2023-08-21.log trace automation.fence_lights 2023-08-21T13_17_28.975420+00_00.json.log

HVR88 commented 1 year ago

I've removed all Z-Wave entities from the two groups (all lights and all switches) I'm using in one particular automation that shows this issue. I haven't been able to reproduce it with only Zigbee and Shelly devices. Testing includes media player commands to multiple entities and the order of the actions has also not had an effect (where it did when Z-wave entities were present).

However, the light group still takes some 5+ seconds to execute either ON or OFF, making the entire automation take about 5.9 seconds.

Contrast this with triggering a mass OFF from Homekit targeting the same entities individually (no HA groups) PLUS Z-Wave entities and that's pretty much instantaneous, taking a maximum of 1 second (or so) for up to 2-3x as many entities.

lux4rd0 commented 1 year ago

With 80+ automations for my 50+ Z-wave lights, I feel your pain on how long it takes to turn more than one device on or off with HA. I even tried using Multicast, and it was just as slow. And it's all over the place on timing. Sometimes it's 1.5 seconds, others it might be 9 seconds. Other times it's instant. Hard to tell. However, on this issue, I've decided to just switch every configuration from:

mode: single

to

mode: restart

Sounds like the changes in 2023.7 and 2023.8 with the integration timeouts are now trending in the right direction. Due to the Supervisor timeouts, I still get full reboots of HA every time I restart the Z-Wave JS container. But I don't have to babysit my automation as much by going in and stopping and starting each automation in the UI. Is it a Band-Aid? Sure. Are others having the same issues with "restart"? Yes - but it's less of a hassle.

As mentioned before, since I'm the original issue owner, I still have not been asked for debugging or log assistance on this. Happy to recreate the issue with logs, but need some specific direction.

lux4rd0 commented 1 year ago

Two more traces.

"Still running"

trace automation.office_night_on 2023-08-22T18_20_23.428243+00_00.json.zip

And the same automation after it "restarting" and then finishes (no change in Z-wave node status... Same automation, same devices.)

Stopped because of unknown reason "null" at August 22, 2023 at 1:28:24 PM

trace automation.office_night_on 2023-08-22T18_28_24.393077+00_00.json.zip

allenporter commented 1 year ago

As mentioned before, since I'm the original issue owner, I still have not been asked for debugging or log assistance on this. Happy to recreate the issue with logs, but need some specific direction.

@lux4rd0 i believe you said above that the issue seemed to be fore z-wave devices? Yes, having logs related to the specific devices that are failing is the right next step. That could include turning up debug logging for z-wave, or and/or for parts of the HA core or script like in this issue which let us see exactly what is happening at each automation step and got to the root cause quickly.

alray31 commented 1 year ago

So 3 days ago, I updated HA core from 2023.8.2 to 2023.8.3, ZwaveJS UI from 1.15.5 to 1.15.6 and HA OS to from 10.4 to 10.5. For the last 3 days, after these updates were done, I did not had a single z-wave automation stuck in the ''still runing'' mode again... when before at least 2 of them were having the problem many times per day. Looks like something changed again that solved the problem (at least on my side).

Now i'm afraid to update core to 2023.8.4 and Zwave JS UI to 1.15.7 ;)

and no more ''the node did not ack the command'' or ''the node was unable to decode the message'' errors. No changes were done other than the updates....

abustin commented 1 year ago

Just to pile on: I've never had a problem with my HA automations hanging indefinitely - but recent HA/Zwave updates have started hanging my automation that run in "single" mode. The automaton simply trigger a single scene (that updates a lot of zwave devices). It makes my ZEN32 scene switch useless.

allenporter commented 1 year ago

@abustin @alray31 appreciate the reports, though i'd prefer if you filed a report with full details from the logs, diagnostics, etc as I was requesting above from the other reporter. There is not enough detail to work with here, so not much is changing yet.

emoses commented 1 year ago

I've been seeing this issue regularly since the .8 upgrade. I tried moving back to .7 and still seeing the issue, but I gather that the cause it the same. I have over 60 zwave devices and many automations involving them. I caught one of them demonstrating this problem last night in zwave debug logs.

HA version: 2023.7.3 zwave-js-ui (container) version: 8.22.3

trace script.1633675778464 2023-08-26T06 14 45.303129+00 00.json.txt

ZWaveJS debug logs from same time: zwavejsui-debug-log-2023-08-25T231406.log

allenporter commented 1 year ago

@emoses do you also have errors/logs logs from the home assistant side? I have a hunch that capturing debug logging on the home assistant side would also be helpful to see what home assistant is doing when communicating to zwavejs.

emoses commented 1 year ago

@allenporter Unfortunately I don't have debug logging enabled and there's nothing interesting in the logs from around that time. Is there a particular logger I should set to debug so I don't have a giant log-flood?

allenporter commented 1 year ago

@emoses You can enable debug logging for specific integrations in the UI which is described in more detail here https://www.home-assistant.io/docs/configuration/troubleshooting/#debug-logs-and-diagnostics

iDontWantAUsername commented 1 year ago

FYI I have done some more detailed error reports in the ticket I logged. If this issue is also just ZWave then the tickets could now be merged: https://github.com/home-assistant/core/issues/98491#issuecomment-1694204344

raman325 commented 1 year ago

we have a theory on what's going on: https://github.com/home-assistant-libs/zwave-js-server-python/pull/729

I need to review these comments, there's a lot to go through, but I am hopeful this theory is right

raman325 commented 1 year ago

In my experience, filing bugs on Z-Wave leads to some frustration when the reports go stale. HA's Z-Wave is sort of like a bus without a driver. Hopefully other commenters have better luck.

I hijacked the other thread on the forums to help, but I am going to hijack this one as well for a sec.

I'd call it more like a fleet of buses having a single driver but also an awesome set of dispatchers. There are some really helpful people in #zwave and who respond to bugs here, and they basically help me get to the root of the issue so I can tackle it, but with the volume of stuff it's hard to filter through the noise (there's a lot of me too comments, improperly formatted comments that expect me to be able to decipher pages of information, theories which have no basis), and I have a life outside of HA 🤷🏾‍♂️ . I do feel bad that feeling is there and wish I could resolve it but without more people contributing to the integration or the other components, I'm not sure what else can be expected.

This is the exact sentiment that led openzwave to be abandoned. It's unfortunate that people haven't learned the lesson from that experience. But I also get it, people rely on these things and when they break like this, it can be extremely frustrating, especially if you don't feel heard.

Regarding reports going stale, please do comment on them to keep them active. Just to give you some perspective on what happens when an issue like this comes up, here's my inbox right now:

image

Doesn't include conversations on Discord, doesn't include monitoring the forums. So please, if you can't help with the development, a little more patience and persistence would be helpful, thanks!

PS: There are a ton of people who are involved in zwave, and I don't want to discount any of their efforts. It's just that I tend to be the one to fix the things when they come up because others either don't have the bandwidth or the comfort to do so.

raman325 commented 1 year ago

@iDontWantAUsername it's not just zwave, there are other integrations affected - I'll see if we can find a way to merge these threads into integration specific ones because without that, it's been hard to follow along all of the details and to decipher what's Z-Wave specific and what's not

mike240se commented 1 year ago

Since this affects so many people and makes HA nearly ununsable in certain cases, has devs considered the ability to turn the 10 second time out back on manually as a workaround until all the integrations can be fixed?

I had to start using continue on error in 2023.7 and didnt know why. In 2023.8 I have the multiple instances of automations issue but I do NOT use restart, all my automations are single. My issue is caused by a flaky basement zwave so it's not easy for me to fix but if I am running the latest zwave-js should it be fixed right now with the latest docker release?

raman325 commented 1 year ago

Since this affects so many people and makes HA nearly ununsable in certain cases, has devs considered the ability to turn the 10 second time out back on manually as a workaround until all the integrations can be fixed?

I had to start using continue on error in 2023.7 and didnt know why. In 2023.8 I have the multiple instances of automations issue but I do NOT use restart, all my automations are single. My issue is caused by a flaky basement zwave so it's not easy for me to fix but if I am running the latest zwave-js should it be fixed right now with the latest docker release?

It's been considered and rejected. The good news is we have a potential fix that will be a part of the beta release on Wednesday

emoses commented 1 year ago

I was able to reproduce, this time with debug logging on the zwave integration z-ui_2023-08-27.log

zwave_debug_log_2023-08-27.log trace script.1633675778464 2023-08-28T04 39 32.257087+00 00.json.log