home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.01k stars 30.17k forks source link

ZHA Watchdog heartbeat timeout after upgrading to latest HA, making ZHA invalid and all devices unavailable #115927

Closed HomeAssistantPim closed 2 months ago

HomeAssistantPim commented 5 months ago

The problem

I upgraded to the latest version and now ZHA crashes frequently. Not sure what was in the upgrade, but this is a serious degradation, I'm even considering a rollback now. The logging states:

2024-04-21 06:49:02.345 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError()
2024-04-21 06:49:05.557 ERROR (bellows.thread_0) [bellows.uart] Lost serial connection: ConnectionResetError('Failed to transmit ASH frame after 4 retries')
2024-04-21 06:49:05.564 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
2024-04-21 06:49:07.324 WARNING (bellows.thread_0) [homeassistant.util.executor] Thread[SyncWorker_0] is still running at shutdown: File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 673, in flush
    termios.tcdrain(self.fd)
2024-04-21 06:49:08.150 WARNING (bellows.thread_0) [homeassistant.util.executor] Thread[SyncWorker_0] is still running at shutdown: File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 673, in flush
    termios.tcdrain(self.fd)

Unfortunately the only way I found to fix this is to restart HA. I will look to find a way to reload ZHA through an automation, but haven't found it yet. I will keep searching for this as this bug makes all my zigbee devices and therefor HA unusable

What version of Home Assistant Core has the issue?

2024.4.3

What was the last working version of Home Assistant Core?

2024.3.3

What type of installation are you running?

Home Assistant Supervised

Integration causing the issue

Zigbee Home Automation

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-04-21 06:49:02.345 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError()
2024-04-21 06:49:05.557 ERROR (bellows.thread_0) [bellows.uart] Lost serial connection: ConnectionResetError('Failed to transmit ASH frame after 4 retries')
2024-04-21 06:49:05.564 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
2024-04-21 06:49:07.324 WARNING (bellows.thread_0) [homeassistant.util.executor] Thread[SyncWorker_0] is still running at shutdown: File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 673, in flush
    termios.tcdrain(self.fd)
2024-04-21 06:49:08.150 WARNING (bellows.thread_0) [homeassistant.util.executor] Thread[SyncWorker_0] is still running at shutdown: File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 673, in flush
    termios.tcdrain(self.fd)

Additional information

No response

home-assistant[bot] commented 5 months ago

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

puddly commented 5 months ago

Please fill out the issue template completely. Include diagnostics information and a full debug log.

HomeAssistantPim commented 5 months ago

@puddly I'm not sure what you mean by diagnostics information, but attach here is debug logging during initialisation. What standsout to me is the occurrences stating: 2024-04-21 20:15:02.104 DEBUG (MainThread) [zigpy.quirks] Fail because input cluster mismatch on at least one endpoint

home-assistant.log

HomeAssistantPim commented 5 months ago

I downgraded to 2024.3.3 and so far so good. Will keep you posted.

HomeAssistantPim commented 5 months ago

@puddly my HA is still running well since the downgrade. In the logging no watchdog timeouts, nor initialisation problems. I noticed a pull request targetting 2024.4.0 that would set new id's for all zigbee devices. It seems related as my logging posted earlier mentions a mismatch in endpoint id's, could this change have caused this?

dmulcahey commented 5 months ago

What PR are you talking about?

HomeAssistantPim commented 5 months ago

This one although I'm not sure if it's actually merged into some 2024.4.x: https://github.com/home-assistant/core/pull/112459

dmulcahey commented 5 months ago

Not merged and unrelated

HomeAssistantPim commented 5 months ago

Ok, so currently I downgraded to 2024.3.3 and zha didn't have issues since. No watchdog timeout, no endpoint mismatch issues as shown in the logging I shared when running on 2024.4.3. Only thing I could do, is upgrade to 2024.4.3 again to see if the issues will reoccur.

mdeletr2 commented 5 months ago

I also have same problem after last update. ZHA stops and trying to reconfigure...... If i restart HA it works again, happened atleast 3 times since last update

halukanlar commented 5 months ago

I also had a problem after updating. Restart solved it for now. We'll see if it lasts...

issue-triage-workflows[bot] commented 2 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.