RuntimeError: dictionary changed size during iteration

ropg commented 1 week ago

The problem

My Home Assistant (Core 2024.5.1, Supervisor 2024.04.4, Operating System 12.2, Frontend 20240501.0) reports:

Logger: bellows.ezsp
Source: /usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py:496
First occurred: 15:46:44 (1 occurrences)
Last logged: 15:46:44

Exception running handler
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 494, in handle_callback
    handler(*args)
  File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 626, in ezsp_callback_handler
    self.connection_lost(args[0])
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 688, in connection_lost
    self.listener_event("connection_lost", exc)
  File "/usr/local/lib/python3.12/site-packages/zigpy/util.py", line 50, in listener_event
    for listener, include_context in self._listeners.values():
RuntimeError: dictionary changed size during iteration

about once a day now, leading to a loss of all my zigbee lights and switches until I reboot, while causing (or being caused by, IDK) my Zigbee Home Automation integration to report "failed setup, will retry". (It periodically retries and fails, only reboot helps.)

What version of Home Assistant Core has the issue?

2024.5.1

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

maybe ZHA, but error shows up in Core logs

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant[bot] commented 1 week ago

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

_{^{(message by CodeOwnersMention)}}

zha documentation zha source _{^{(message by IssueLinks)}}

TheJulianJES commented 1 week ago

Thanks. Currently tracking that issue here:

https://github.com/zigpy/zigpy/issues/1393

puddly commented 1 week ago

I think this may be happening during disconnect because HA tasks now eagerly execute, so create_task() with synchronous operations will actually run instantly. This will require some changes in zigpy. We don't expect event listeners to modify the list of listeners like this.

ropg commented 1 week ago

Naturally this needs to be fixed, but as a workaround, the following automation reboots my instance if a particular Zigbee light becomes unavailable for 3 minutes.

alias: Reboot when zigbee goes away
description: ""
trigger:
  - platform: state
    entity_id:
      - light.n1_light
    to: unavailable
    for:
      hours: 0
      minutes: 3
      seconds: 0
condition: []
action:
  - service: hassio.host_reboot
    data: {}
mode: single

(I see it executing about 1-2 times a day at unpredictable and seemingly random times when nothing particular is happening.)

TheJulianJES commented 1 week ago

The error you're seeing shouldn't cause any noticeable issues. You're likely only seeing that because your dongle is disconnecting or crashing.

ropg commented 1 week ago

Whatever happens, I need a host reboot for it to clear, not just a restart of HA. So well possible my dongle crashes, indeed. It's a Nabu Casa SkyConnect:

DEVLINKS: >-
  /dev/serial/by-id/usb-Nabu_Casa_SkyConnect_v1.0_680ca9e8b496ed11aad0c998a7669f5d-if00-port0
  /dev/serial/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1.3:1.0-port0
  /dev/serial/by-path/platform-fd500000.pcie-pci-0000:01:00.0-usbv2-0:1.3:1.0-port0
DEVNAME: /dev/ttyUSB0
DEVPATH: >-
  /devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb1/1-1/1-1.3/1-1.3:1.0/ttyUSB0/tty/ttyUSB0
ID_BUS: usb
ID_MODEL: SkyConnect_v1.0
ID_MODEL_ENC: SkyConnect\x20v1.0
ID_MODEL_ID: ea60
ID_PATH: platform-fd500000.pcie-pci-0000:01:00.0-usb-0:1.3:1.0
ID_PATH_TAG: platform-fd500000_pcie-pci-0000_01_00_0-usb-0_1_3_1_0
ID_PATH_WITH_USB_REVISION: platform-fd500000.pcie-pci-0000:01:00.0-usbv2-0:1.3:1.0
ID_REVISION: '0100'
ID_SERIAL: Nabu_Casa_SkyConnect_v1.0_680ca9e8b496ed11aad0c998a7669f5d
ID_SERIAL_SHORT: 680ca9e8b496ed11aad0c998a7669f5d
ID_TYPE: generic
ID_USB_DRIVER: cp210x
ID_USB_INTERFACES: ':ff0000:'
ID_USB_INTERFACE_NUM: '00'
ID_USB_MODEL: SkyConnect_v1.0
ID_USB_MODEL_ENC: SkyConnect\x20v1.0
ID_USB_MODEL_ID: ea60
ID_USB_REVISION: '0100'
ID_USB_SERIAL: Nabu_Casa_SkyConnect_v1.0_680ca9e8b496ed11aad0c998a7669f5d
ID_USB_SERIAL_SHORT: 680ca9e8b496ed11aad0c998a7669f5d
ID_USB_TYPE: generic
ID_USB_VENDOR: Nabu_Casa
ID_USB_VENDOR_ENC: Nabu\x20Casa
ID_USB_VENDOR_ID: 10c4
ID_VENDOR: Nabu_Casa
ID_VENDOR_ENC: Nabu\x20Casa
ID_VENDOR_ID: 10c4
MAJOR: '188'
MINOR: '0'
SUBSYSTEM: tty
TAGS: ':systemd:'
USEC_INITIALIZED: '3329682'

puddly commented 1 week ago

Are you running it with normal ZigBee firmware or with multiprotocol?

ropg commented 1 week ago

Multiprotocol. Have no thread devices though, so could turn that off.

puddly commented 1 week ago

Normal ZigBee firmware doesn't crash like this so you'd benefit from migrating off of multi-PAN: https://skyconnect.home-assistant.io/procedures/disable-multiprotocol/

ropg commented 1 week ago

Turned it off, will let you know if this resolves my issue. Still curious why this started happening out of the blue ~ a week ago.

evakq8r commented 4 days ago

Just to add to this, I'm also having the identical issue reported here (also with a Skyconnect) but only flashed for Zigbee duties (not MultiPAN). This has been happening for me for several weeks and is only cleared with a server reboot (HA restart doesn't fix it, VM reboot doesn't fix it, needs a complete server reboot that HA is hosted on).

Sometimes happens 3-4 times a day, sometimes 3-4 times a week. I'll raise a new issue with other logs but it does appear to occur with just the normal Zigbee firmware as well.

puddly commented 4 days ago

HA restart doesn't fix it, VM reboot doesn't fix it, needs a complete server reboot that HA is hosted on

This is really odd. For anything other than multi-PAN, any state is completely reset when you reload just the integration, let alone anything else. Requiring a physical reset of the hardware means that the stick needs to be reset as well. This isn't a known issue with anything other than multi-PAN.

Can you record a debug log of this happening? Could you also include in the log what happens when you reload the integration, when the stick is in this state?

evakq8r commented 4 days ago

Can you record a debug log of this happening?

Zigbee crashed again about 20 minutes ago. Initiated a host restart and now HA is no longer accessible (sigh). Here's the same batch of logs from a few days ago:

First log:

Logger: bellows.uart
Source: /usr/local/lib/python3.12/site-packages/bellows/uart.py:252
First occurred: 01:40:53 (1 occurrences)
Last logged: 01:40:53

Lost serial connection: ConnectionResetError('Failed to transmit ASH frame after 4 retries')

Second log:

Logger: bellows.ezsp
Source: /usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py:496
First occurred: 01:40:53 (1 occurrences)
Last logged: 01:40:53

Exception running handler
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 494, in handle_callback
    handler(*args)
  File "/usr/local/lib/python312/site-packages/bellows/zigbee/application.py", line 626, in ezsp_callback_handler
    self.connection_lost(args[0])
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 688, in connection_lost
    self.listener_event("connection_lost", exc)
  File "/usr/local/lib/python3.12/site-packages/zigpy/util.py", line 50, in listener_event
    for listener, include_context in self._listeners.values():
RuntimeError: dictionary changed size during iteration

Third log:

Logger: homeassistant.util.executor
Source: util/executor.py:30
First occurred: 01:40:54 (2 occurrences)
Last logged: 01:40:55

Thread[SyncWorker_0] is still running at shutdown: File "/usr/local/lib/python3.12/threading.py", line 1030, in _bootstrap self._bootstrap_inner() File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner self.run() File "/usr/local/lib/python3.12/threading.py", line 1010, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker work_item.run() File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 673, in flush termios.tcdrain(self.fd)

I'm not near the server at the moment, but will try and get more logs when able. That said, the above sequence of logs occurs every time Zigbee crashes.

ropg commented 4 days ago

So far, my problem seems to have disappeared after flashing my SkyConnect back to Zigbee-only. Not sure whether that means I should close this as it does have others that have remaining issues on the stick with Zigbee-only in the conversation. I'll leave it open for now, but feel free to close.

home-assistant / core