Closed jose1711 closed 4 months ago
Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
zha documentation zha source (message by IssueLinks)
How exactly are you running Home Assistant? What other software (addons, custom components, official integrations) are you running on the device? The specific error you're seeing happens if you run out of file descriptors, which usually suggests some sort of network-connected code exhausting them by leaking descriptors.
Home Assistant is running within a Docker container on a Raspbian system, alongside a few other (smaller) containers. There are no addons (container version does not support its installations) but I am using some custom components installed manually, namely phonetrack
, smartthing_sensors
and zha_toolkit
. The total number of enabled integrations is 26 - notable ones are Bluetooth, Wiz, Google Cast, SmartThings and Google Assistant SDK.
Today I collected another debug log - the error signature is the same as in the original report. Initially zigpy.backups
reports the addition of a new backup, followed by a reset of ASH and subsequent stream of errors.
This time I was able to locate a situation in the same log where HA successfully recover from a similar situation although after a considerable delay:
2024-03-14 15:53:41.161 DEBUG (MainThread) [zigpy.backups] Adding a new backup NetworkBackup(version=1, backup_time=datetime.datetime(2024, 3, 14, 13, 12, 56, 578933, tzinfo=datetime.timezone.utc), network_info=NetworkInfo(..
2024-03-14 15:53:41.428 DEBUG (bellows.thread_0) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (115200 baudrate)
2024-03-14 15:53:41.550 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2024-03-14 15:53:41.552 DEBUG (bellows.thread_0) [bellows.uart] Resetting ASH
2024-03-14 15:53:41.553 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'1ac038bc7e'
2024-03-14 15:53:42.655 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.12/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
data = self._serial.read(self._max_read_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 565, in read
ready, _, _ = select.select([self.fd, self.pipe_abort_read_r], [], [], timeout.time_left())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
(a very long stream of errors)
..
..
2024-03-14 16:22:57.419 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.12/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
data = self._serial.read(self._max_read_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 565, in read
ready, _, _ = select.select([self.fd, self.pipe_abort_read_r], [], [], timeout.time_left())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
2024-03-14 16:22:57.429 DEBUG (Thread-65) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x52170028>, 'PRAGMA wal_checkpoint;', [])
2024-03-14 16:22:57.430 DEBUG (Thread-65) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x52170028>, 'PRAGMA wal_checkpoint;', []) completed
2024-03-14 16:22:57.433 DEBUG (Thread-65) [aiosqlite] executing functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x4bda7208>)
2024-03-14 16:22:57.434 DEBUG (Thread-65) [aiosqlite] operation functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x4bda7208>) completed
2024-03-14 16:22:57.441 DEBUG (Thread-65) [aiosqlite] executing functools.partial(<built-in method close of sqlite3.Connection object at 0x52170028>)
2024-03-14 16:22:57.458 DEBUG (Thread-65) [aiosqlite] operation functools.partial(<built-in method close of sqlite3.Connection object at 0x52170028>) completed
2024-03-14 16:22:57.476 DEBUG (bellows.thread_0) [bellows.uart] Connection lost: None
2024-03-14 16:22:57.477 DEBUG (bellows.thread_0) [bellows.uart] Closed serial connection
2024-03-14 16:22:57.469 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 152, in async_setup_entry
zha_gateway = await ZHAGateway.async_from_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 196, in async_from_config
await instance.async_initialize()
File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 214, in async_initialize
await app.startup(auto_form=True)
File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 233, in startup
await self.connect()
File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 148, in connect
await ezsp.startup_reset()
File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 125, in startup_reset
await self.reset()
File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 151, in reset
await self._gw.reset()
TimeoutError
2024-03-14 16:22:57.483 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.GECKO_BOOTLOADER at 115200 baud
2024-03-14 16:22:57.484 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (115200 baudrate)
2024-03-14 16:22:57.500 DEBUG (MainThread) [universal_silabs_flasher.common] Connection made: SerialTransport(<_UnixSelectorEventLoop running=True closed=False debug=False>, <universal_silabs_flasher.gecko_bootloader.GeckoBootloaderProtocol object at 0x5727bbe8>, Serial<id=0x5169abe0, open=True>(port='/dev/ttyUSB0', baudrate=115200, bytesize=8, parity='N', stopbits=1, timeout=0, xonxoff=False, rtscts=False, dsrdtr=False))
2024-03-14 16:22:57.504 DEBUG (MainThread) [universal_silabs_flasher.common] Sending data b'\n'
2024-03-14 16:22:57.505 DEBUG (MainThread) [universal_silabs_flasher.common] Sending data b'3'
2024-03-14 16:22:59.508 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.CPC at 460800 baud
2024-03-14 16:22:59.509 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (460800 baudrate)
During this episode HA logs showed that one of the Zigbee devices was seen as unavailable between 15:53:33 and 16:24:50. I will test if waiting at least 30 minutes resolves the problem consistently.
After a restart of HA, the number of netstat -an
entries initially rised but then it stabilized at around 2300-2400 entries. At this moment I don't see any link between a number of open sockets and the failure.
What is a bit strange in the netstat
output however is a number of established UDP connections to one of the Wiz bulbs.
udp 0 0 192.168.1.101:51625 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:44460 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:49581 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:40880 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:53683 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:51635 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:56245 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:51127 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:34242 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:52674 192.168.1.105:38899 ESTABLISHED
udp 0 0 192.168.1.101:51142 192.168.1.105:38899 ESTABLISHED
netstat -an | grep -c 192.168.1.105:
207
I will try to temporarily uninstall the lightbulb to see if the situation improves.
I will test if waiting at least 30 minutes resolves the problem consistently.
Got another failure and it did not recover even after 3 hours so looks like I was just lucky.
I was able to reproduce the issue inside the running HA container with the following code:
#!/usr/bin/env python
import os
import select
portstr = '/dev/ttyUSB0'
fd = os.open(portstr, os.O_RDWR | os.O_NOCTTY | os.O_NONBLOCK)
for i in range(4000):
print(i)
pipe_abort_read_r, pipe_abort_read_w = os.pipe()
ready, _, _ = select.select([fd, pipe_abort_read_r], [], [], 0)
The execution always stops at
Traceback (most recent call last):
File "//./test.py", line 11, in <module>
ready, _, _ = select.select([fd, pipe_abort_read_r], [], [], 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
indicating that the process is unable to open more than 1024 file descriptors. This seems to be in line with the following SO answer https://stackoverflow.com/a/7329415/5674914.
What hardware are you running this on?
What hardware are you running this on?
Raspberry Pi 2, model B rev 1.1 (ARM v7), 1 GB RAM, SD card-backed storage
I don’t believe RPi 2 is supported
Other than this particular problem I am not seeing any issues (including performance problems). What makes you think this could be platform-specific?
Edit: Also despite what HA website is saying, HA OS images for RPi2 are still being generated (https://github.com/home-assistant/operating-system/releases/tag/12.1)
You'll have to figure out what's eating up the file descriptors. Try lsof
, for example.
It looks like there is something fishy going on inside the Wiz integration. The number of sockets (established) for two devices is permanently increasing:
# bulb1
$ grep -c 192.168.1.105: lsof_details_march_19*
lsof_details_march_19_0800.log:70
lsof_details_march_19_1930.log:1116
lsof_details_march_19_2300.log:1420
# bulb2
$ grep -c 192.168.1.102: lsof_details_march_19*
lsof_details_march_19_0800.log:14
lsof_details_march_19_1930.log:315
lsof_details_march_19_2300.log:434
Normally I get a single socket per each bulb so something's not right for those two.
Ever since I disconnected the two devices causing the issue, ZHA integration is no longer causing an issue.
The problem
I am seeing a frequent (once/twice a day) failures with ZHA integration. It says "initializing" but fails to finish. After encountering the issue, I collected a debug log, and just prior to the first relevant error in the log, there is a message regarding a "zigpy" backup. Not sure if it's related but I will enable debug log again to see if the same behaviour recurs.
Related log part (some parts manually redacted)
Full debug log is 300 MB.
What version of Home Assistant Core has the issue?
core-2024.3.0
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant Container
Integration causing the issue
ZHA
Link to integration documentation on our website
https://www.home-assistant.io/integrations/zha/
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
I have found that
NetworkBackup()
is executed several times prior to failure, here is one which did not cause any problem.Additional information
HW: Raspberry Pi 2, model B rev 1.1 (kernel 6.1.21-v7+) and SkyConnect dongle with latest firmware (7.3.2.0 build 212)