home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.14k stars 29.81k forks source link

ZHA integration and SkyConnect - ValueError: filedescriptor out of range in select() #113244

Closed jose1711 closed 4 months ago

jose1711 commented 6 months ago

The problem

I am seeing a frequent (once/twice a day) failures with ZHA integration. It says "initializing" but fails to finish. After encountering the issue, I collected a debug log, and just prior to the first relevant error in the log, there is a message regarding a "zigpy" backup. Not sure if it's related but I will enable debug log again to see if the same behaviour recurs.

Related log part (some parts manually redacted)

..
..
2024-03-13 17:33:23.399 DEBUG (Thread-24) [aiosqlite] operation functools.partial(<built-in method fetchmany of sqlite3.Cursor object at 0x54184628>, 64) completed
2024-03-13 17:33:23.403 DEBUG (Thread-24) [aiosqlite] executing functools.partial(<built-in method fetchmany of sqlite3.Cursor object at 0x54184628>, 64)
2024-03-13 17:33:23.404 DEBUG (Thread-24) [aiosqlite] operation functools.partial(<built-in method fetchmany of sqlite3.Cursor object at 0x54184628>, 64) completed
2024-03-13 17:33:23.408 DEBUG (Thread-24) [aiosqlite] executing functools.partial(<built-in method close of sqlite3.Cursor object at 0x54184628>)
2024-03-13 17:33:23.409 DEBUG (Thread-24) [aiosqlite] operation functools.partial(<built-in method close of sqlite3.Cursor object at 0x54184628>) completed
2024-03-13 17:33:23.412 DEBUG (MainThread) [zigpy.backups] Adding a new backup NetworkBackup(version=1, backup_time=datetime.datetime(2024, 3, 13, 13, 45, 27, 805582, tzinfo=datetime.timezone.utc), network_info=NetworkInfo(extended_pan_id=REDACTED, pan_id=0x1551, nwk_update_id=0, nwk_manager_id=0x0000, channel=20, channel_mask=<Channels.ALL_CHANNELS: 134215680>, security_level=5, network_key=Key(key=REDACTED tx_counter=1867791, rx_counter=0, seq=0, partner_ieee=ff:ff:ff:ff:ff:ff:ff:ff), tc_link_key=Key(key=REDACTED, tx_counter=266240, rx_counter=0, seq=0, partner_ieee=REDACTED), key_table=[], children=[], nwk_addresses={}, stack_specific={'ezsp': {'hashed_tclk': 'REDACTED'}}, metadata={'ezsp': {'stack_version': 12, 'can_burn_userdata_custom_eui64': True, 'can_rewrite_custom_eui64': True}}, source='bellows@0.38.1'), node_info=NodeInfo(nwk=0x0000, ieee=REDACTED, logical_type=<LogicalType.Coordinator: 0>, model='SkyConnect v1.0', manufacturer='Nabu Casa', version='7.3.2.0 build 212'))
2024-03-13 17:33:23.427 DEBUG (bellows.thread_0) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (115200 baudrate)
2024-03-13 17:33:23.479 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2024-03-13 17:33:23.491 DEBUG (bellows.thread_0) [bellows.uart] Resetting ASH
2024-03-13 17:33:23.492 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'1ac038bc7e'
2024-03-13 17:33:24.593 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.12/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
    data = self._serial.read(self._max_read_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 565, in read
    ready, _, _ = select.select([self.fd, self.pipe_abort_read_r], [], [], timeout.time_left())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
2024-03-13 17:33:24.644 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.12/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
    data = self._serial.read(self._max_read_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 565, in read
    ready, _, _ = select.select([self.fd, self.pipe_abort_read_r], [], [], timeout.time_left())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
# same error repeats many times

Full debug log is 300 MB.

What version of Home Assistant Core has the issue?

core-2024.3.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

ZHA

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

I have found that NetworkBackup() is executed several times prior to failure, here is one which did not cause any problem.

2024-03-12 17:35:47.538 DEBUG (Thread-11) [aiosqlite] executing functools.partial(<built-in method close of sqlite3.Cursor object at 0x5a6401e8>)
2024-03-12 17:35:47.538 DEBUG (Thread-11) [aiosqlite] operation functools.partial(<built-in method close of sqlite3.Cursor object at 0x5a6401e8>) completed
2024-03-12 17:35:47.542 DEBUG (MainThread) [zigpy.backups] Adding a new backup NetworkBackup(version=1, backup_time=datetime.datetime(2024, 3, 12, 14, 25, 4, 192084, tzinfo=datetime.timezone.utc), network_info=NetworkInfo(extended_pan_id=02:0a:c5:8e:1a:fd:4b:cf, pan_id=0x1551, nwk_update_id=0, nwk_manager_id=0x0000, channel=20, channel_mask=<Channels.ALL_CHANNELS: 134215680>, security_level=5, network_key=Key(key=REDACTED, tx_counter=1810452, rx_counter=0, seq=0, partner_ieee=ff:ff:ff:ff:ff:ff:ff:ff), tc_link_key=Key(key=REDACTED, tx_counter=245760, rx_counter=0, seq=0, partner_ieee=REDACTED), key_table=[], children=[], nwk_addresses={}, stack_specific={'ezsp': {'hashed_tclk': 'REDACTED'}}, metadata={'ezsp': {'stack_version': 12, 'can_burn_userdata_custom_eui64': True, 'can_rewrite_custom_eui64': True}}, source='bellows@0.38.1'), node_info=NodeInfo(nwk=0x0000, ieee=REDACTED, logical_type=<LogicalType.Coordinator: 0>, model='SkyConnect v1.0', manufacturer='Nabu Casa', version='7.3.2.0 build 212'))
2024-03-12 17:35:47.562 DEBUG (bellows.thread_0) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (115200 baudrate)
2024-03-12 17:35:47.596 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2024-03-12 17:35:47.601 DEBUG (bellows.thread_0) [bellows.uart] Resetting ASH
2024-03-12 17:35:47.602 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'1ac038bc7e'
2024-03-12 17:35:48.703 DEBUG (bellows.thread_0) [bellows.uart] RSTACK Version: 2 Reason: RESET_SOFTWARE frame: b'c1020b0a527e'
2024-03-12 17:35:48.706 DEBUG (MainThread) [bellows.ezsp] Switching to EZSP protocol version 4
2024-03-12 17:35:48.707 DEBUG (MainThread) [bellows.ezsp.protocol] Send command version: (4,)
2024-03-12 17:35:48.709 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'004221a850ed2c7e'
2024-03-12 17:35:48.717 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'0142a1a8582835c1abb17e'

Additional information

HW: Raspberry Pi 2, model B rev 1.1 (kernel 6.1.21-v7+) and SkyConnect dongle with latest firmware (7.3.2.0 build 212)

home-assistant[bot] commented 6 months ago

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

puddly commented 6 months ago

How exactly are you running Home Assistant? What other software (addons, custom components, official integrations) are you running on the device? The specific error you're seeing happens if you run out of file descriptors, which usually suggests some sort of network-connected code exhausting them by leaking descriptors.

jose1711 commented 6 months ago

Home Assistant is running within a Docker container on a Raspbian system, alongside a few other (smaller) containers. There are no addons (container version does not support its installations) but I am using some custom components installed manually, namely phonetrack, smartthing_sensors and zha_toolkit. The total number of enabled integrations is 26 - notable ones are Bluetooth, Wiz, Google Cast, SmartThings and Google Assistant SDK.

Today I collected another debug log - the error signature is the same as in the original report. Initially zigpy.backups reports the addition of a new backup, followed by a reset of ASH and subsequent stream of errors.

This time I was able to locate a situation in the same log where HA successfully recover from a similar situation although after a considerable delay:

2024-03-14 15:53:41.161 DEBUG (MainThread) [zigpy.backups] Adding a new backup NetworkBackup(version=1, backup_time=datetime.datetime(2024, 3, 14, 13, 12, 56, 578933, tzinfo=datetime.timezone.utc), network_info=NetworkInfo(..
2024-03-14 15:53:41.428 DEBUG (bellows.thread_0) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (115200 baudrate)
2024-03-14 15:53:41.550 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
2024-03-14 15:53:41.552 DEBUG (bellows.thread_0) [bellows.uart] Resetting ASH
2024-03-14 15:53:41.553 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'1ac038bc7e'
2024-03-14 15:53:42.655 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.12/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
    data = self._serial.read(self._max_read_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 565, in read
    ready, _, _ = select.select([self.fd, self.pipe_abort_read_r], [], [], timeout.time_left())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()

(a very long stream of errors)
..
..
2024-03-14 16:22:57.419 ERROR (bellows.thread_0) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.12/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
    data = self._serial.read(self._max_read_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/serial/serialposix.py", line 565, in read
    ready, _, _ = select.select([self.fd, self.pipe_abort_read_r], [], [], timeout.time_left())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
2024-03-14 16:22:57.429 DEBUG (Thread-65) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x52170028>, 'PRAGMA wal_checkpoint;', [])
2024-03-14 16:22:57.430 DEBUG (Thread-65) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x52170028>, 'PRAGMA wal_checkpoint;', []) completed
2024-03-14 16:22:57.433 DEBUG (Thread-65) [aiosqlite] executing functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x4bda7208>)
2024-03-14 16:22:57.434 DEBUG (Thread-65) [aiosqlite] operation functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x4bda7208>) completed
2024-03-14 16:22:57.441 DEBUG (Thread-65) [aiosqlite] executing functools.partial(<built-in method close of sqlite3.Connection object at 0x52170028>)
2024-03-14 16:22:57.458 DEBUG (Thread-65) [aiosqlite] operation functools.partial(<built-in method close of sqlite3.Connection object at 0x52170028>) completed
2024-03-14 16:22:57.476 DEBUG (bellows.thread_0) [bellows.uart] Connection lost: None
2024-03-14 16:22:57.477 DEBUG (bellows.thread_0) [bellows.uart] Closed serial connection
2024-03-14 16:22:57.469 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 152, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 196, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 214, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 233, in startup
    await self.connect()
  File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 148, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 125, in startup_reset
    await self.reset()
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 151, in reset
    await self._gw.reset()
TimeoutError
2024-03-14 16:22:57.483 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.GECKO_BOOTLOADER at 115200 baud
2024-03-14 16:22:57.484 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (115200 baudrate)
2024-03-14 16:22:57.500 DEBUG (MainThread) [universal_silabs_flasher.common] Connection made: SerialTransport(<_UnixSelectorEventLoop running=True closed=False debug=False>, <universal_silabs_flasher.gecko_bootloader.GeckoBootloaderProtocol object at 0x5727bbe8>, Serial<id=0x5169abe0, open=True>(port='/dev/ttyUSB0', baudrate=115200, bytesize=8, parity='N', stopbits=1, timeout=0, xonxoff=False, rtscts=False, dsrdtr=False))
2024-03-14 16:22:57.504 DEBUG (MainThread) [universal_silabs_flasher.common] Sending data b'\n'
2024-03-14 16:22:57.505 DEBUG (MainThread) [universal_silabs_flasher.common] Sending data b'3'
2024-03-14 16:22:59.508 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.CPC at 460800 baud
2024-03-14 16:22:59.509 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyUSB0' (460800 baudrate)

During this episode HA logs showed that one of the Zigbee devices was seen as unavailable between 15:53:33 and 16:24:50. I will test if waiting at least 30 minutes resolves the problem consistently.

After a restart of HA, the number of netstat -an entries initially rised but then it stabilized at around 2300-2400 entries. At this moment I don't see any link between a number of open sockets and the failure.

What is a bit strange in the netstat output however is a number of established UDP connections to one of the Wiz bulbs.

udp        0      0 192.168.1.101:51625     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:44460     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:49581     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:40880     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:53683     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:51635     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:56245     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:51127     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:34242     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:52674     192.168.1.105:38899     ESTABLISHED
udp        0      0 192.168.1.101:51142     192.168.1.105:38899     ESTABLISHED
netstat -an | grep -c 192.168.1.105:
207

I will try to temporarily uninstall the lightbulb to see if the situation improves.

jose1711 commented 6 months ago

I will test if waiting at least 30 minutes resolves the problem consistently.

Got another failure and it did not recover even after 3 hours so looks like I was just lucky.

jose1711 commented 6 months ago

I was able to reproduce the issue inside the running HA container with the following code:

#!/usr/bin/env python
import os
import select
portstr = '/dev/ttyUSB0'

fd = os.open(portstr, os.O_RDWR | os.O_NOCTTY | os.O_NONBLOCK)

for i in range(4000):
    print(i)
    pipe_abort_read_r, pipe_abort_read_w = os.pipe()
    ready, _, _ = select.select([fd, pipe_abort_read_r], [], [], 0)

The execution always stops at

Traceback (most recent call last):
  File "//./test.py", line 11, in <module>
    ready, _, _ = select.select([fd, pipe_abort_read_r], [], [], 0)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()

indicating that the process is unable to open more than 1024 file descriptors. This seems to be in line with the following SO answer https://stackoverflow.com/a/7329415/5674914.

dmulcahey commented 6 months ago

What hardware are you running this on?

jose1711 commented 6 months ago

What hardware are you running this on?

Raspberry Pi 2, model B rev 1.1 (ARM v7), 1 GB RAM, SD card-backed storage

dmulcahey commented 6 months ago

I don’t believe RPi 2 is supported

jose1711 commented 6 months ago

Other than this particular problem I am not seeing any issues (including performance problems). What makes you think this could be platform-specific?

Edit: Also despite what HA website is saying, HA OS images for RPi2 are still being generated (https://github.com/home-assistant/operating-system/releases/tag/12.1)

puddly commented 6 months ago

You'll have to figure out what's eating up the file descriptors. Try lsof, for example.

jose1711 commented 5 months ago

It looks like there is something fishy going on inside the Wiz integration. The number of sockets (established) for two devices is permanently increasing:

# bulb1
$ grep -c 192.168.1.105: lsof_details_march_19*
lsof_details_march_19_0800.log:70
lsof_details_march_19_1930.log:1116
lsof_details_march_19_2300.log:1420

# bulb2
$ grep -c 192.168.1.102: lsof_details_march_19*
lsof_details_march_19_0800.log:14
lsof_details_march_19_1930.log:315
lsof_details_march_19_2300.log:434

Normally I get a single socket per each bulb so something's not right for those two.

jose1711 commented 4 months ago

Ever since I disconnected the two devices causing the issue, ZHA integration is no longer causing an issue.