home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.85k stars 967 forks source link

Dialout group mapping mismatch HAOS-Core possibly causing Conbee II unable to connect #1878

Closed remygithub closed 2 years ago

remygithub commented 2 years ago

Describe the issue you are experiencing

Neither deCONZ nor ZHA are able to connect to the usb-serial port (Conbee II , /dev/ttyACM0). The numerical GID ownership of the device is 18. Logs show that the Conbee is correctly detected by HAOS but ttyd keeps segfaulting.

Possible wrong GID of dialout on HAOS (below, Ubuntu is my laptop on 20.04): System dialout audio
HAOS 18 29
Core 20 18
Ubuntu 20 29

The usb-serial port (/dev/ttyACM0) is owned by audio on core and dialout on HAOS. Access to the port via group membership (typical method on linux) is denied leading to deCONZ/ZHA not being able to connect (my situation and possibly related to other issues).

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

7.6

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

starting without deCONZ or ZHA after plugging in the Conbee II stick and the system has correctly identified it

  1. Add integration: deCONZ (or ZHA)
  2. Select the Conbee protocol
  3. Select the associtated usb-serial port

Result: Unable to connect, logs below: The logs show that the USB device is properly discovered and loaded but HA times out trying to connect to it:

[cont-init.d] firmware.sh: executing... [20:51:44] INFO: GCFFlasher V3_17 (c) dresden elektronik ingenieurtechnik gmbh Path | Vendor | Product | Serial | Type -----------------+--------+---------+------------+------- /dev/ttyACM0 | 0x1CF1 | 0x0030 | DE22XXXX | ConBee II ..... 20:52:00:811 DEV no DDF for 0xXXXXXXXXXXX, modelId: lumi.weather 20:52:00:812 DEV create on-the-fly DDF for 0xXXXXXXXXXXX 20:52:00:840 DEV Tick.Init: booted after 8000 seconds 20:52:00:853 dlg action: Read binding table 20:52:00:926 New websocket 127.0.0.1:48642 (state: 3) 20:52:00:928 New websocket 172.30.32.1:53698 (state: 3) [20:52:01] INFO: Successfully send discovery information to Home Assistant. 20:52:01:509 Announced to internet https://phoscon.de/discover 20:52:01:638 COM: /dev/serial/by-id/usb-dresden_elektronik_ingenieurtechnik_GmbH_ConBee_II_DE22XXXX-if00 / serialno: DE22XXXXX, ConBee II 20:52:01:759 Skip idle timer callback, too early: elapsed 946 msec .... 20:52:03:317 device state timeout ignored in state 2 20:52:03:477 device state timeout ignored in state 2 20:52:03:637 device state timeout ignored in state 2 20:52:03:796 device state timeout ignored in state 2 20:52:03:956 device state timeout ignored in state 2 20:52:04:116 device state timeout ignored in state 2 .... 21:29:58:737 start reconnect to network 21:30:03:737 failed to reconnect to network try=1 21:30:05:775 COM: /dev/serial/by-id/usb-dresden_elektronik_ingenieurtechnik_GmbH_ConBee_II_DE22XXXX-if00 / serialno: DE22XXXX, ConBee II 21:30:06:736 Skip idle timer callback, too early: elapsed 941 msec 21:30:07:316 device state timeout ignored in state 2 21:30:07:477 device state timeout ignored in state 2 21:30:07:637 device state timeout ignored in state 2 ...

Anything in the Supervisor logs that might be useful for us?

The logs show that the USB device is properly discovered and loaded but HA times out trying to connect to it:

[cont-init.d] firmware.sh: executing... 
[20:51:44] INFO: GCFFlasher V3_17 (c) dresden elektronik ingenieurtechnik gmbh
Path             | Vendor | Product | Serial     | Type
-----------------+--------+---------+------------+-------
/dev/ttyACM0     | 0x1CF1 | 0x0030  | DE22XXXX  | ConBee II 
.....
20:52:00:811 DEV no DDF for 0xXXXXXXXXXXX, modelId: lumi.weather
20:52:00:812 DEV create on-the-fly DDF for 0xXXXXXXXXXXX
20:52:00:840 DEV Tick.Init: booted after 8000 seconds
20:52:00:853 dlg action: Read binding table
20:52:00:926 New websocket 127.0.0.1:48642 (state: 3) 
20:52:00:928 New websocket 172.30.32.1:53698 (state: 3) 
[20:52:01] INFO: Successfully send discovery information to Home Assistant.
20:52:01:509 Announced to internet https://phoscon.de/discover
20:52:01:638 COM: /dev/serial/by-id/usb-dresden_elektronik_ingenieurtechnik_GmbH_ConBee_II_DE22XXXX-if00 / serialno: DE22XXXXX, ConBee II
20:52:01:759 Skip idle timer callback, too early: elapsed 946 msec
....
20:52:03:317 device state timeout ignored in state 2
20:52:03:477 device state timeout ignored in state 2
20:52:03:637 device state timeout ignored in state 2
20:52:03:796 device state timeout ignored in state 2
20:52:03:956 device state timeout ignored in state 2
20:52:04:116 device state timeout ignored in state 2
....
21:29:58:737 start reconnect to network
21:30:03:737 failed to reconnect to network try=1
21:30:05:775 COM: /dev/serial/by-id/usb-dresden_elektronik_ingenieurtechnik_GmbH_ConBee_II_DE22XXXX-if00 / serialno: DE22XXXX, ConBee II
21:30:06:736 Skip idle timer callback, too early: elapsed 941 msec
21:30:07:316 device state timeout ignored in state 2
21:30:07:477 device state timeout ignored in state 2
21:30:07:637 device state timeout ignored in state 2

Anything in the Host logs that might be useful for us?

Logs also show that the Conbee is correctly detected by HAOS but ttyd keeps segfaulting (screenshot below)

[  387.375767] usb 5-1: New USB device found, idVendor=1cf1, idProduct=0030, bcdDevice= 1.00
[  387.375773] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  387.375777] usb 5-1: Product: ConBee II
[  387.375780] usb 5-1: Manufacturer: dresden elektronik ingenieurtechnik GmbH
[  387.375783] usb 5-1: SerialNumber: DEXXXXXX
[  387.393143] cdc_acm 5-1:1.0: ttyACM0: USB ACM device
[  387.396050] usbcore: registered new interface driver cdc_acm
[  387.396053] cdc_acm: USB Abstract Control Model driver for USB modems and ISDN adapters
[ 1111.486762] show_signal_msg: 4 callbacks suppressed
[ 1111.486769] ttyd[5691]: segfault at 7fb8fa297464 ip 000055993d8663a2 sp 00007ffdae39d298 error 4 in ttyd[55993d859000+2b000]
[ 1111.486789] Code: 48 8b 3c 24 e8 fe fb ff ff 48 8b 4c 24 28 64 48 2b 0c 25 28 00 00 00 74 05 e8 7a 35 ff ff 48 83 c4 38 c3 e9 79 ff ff ff 31 c0 <66> 83 bf 44 04 00 00 1f 74 67 f6 87 5d 04 00 00 10 75 5e 48 83 ec
[ 1111.486853] audit: type=1701 audit(1651340411.304:192): auid=4294967295 uid=0 gid=0 ses=4294967295 subj==docker-default (enforce) pid=5691 comm="ttyd" exe="/usr/bin/ttyd" sig=11 res=1

System Health information

System Health

version core-2022.4.7
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.9.9
os_name Linux
os_version 5.10.108
arch x86_64
timezone America/New_York
Home Assistant Community Store GitHub API | ok -- | -- Github API Calls Remaining | 4986 Installed Version | 1.22.0 Stage | running Available Repositories | 1026 Downloaded Repositories | 2
Home Assistant Cloud logged_in | true -- | -- subscription_expiration | May 17, 2022, 8:00 PM relayer_connected | true remote_enabled | true remote_connected | true alexa_enabled | false google_enabled | false remote_server | us-east-1-3.ui.nabu.casa can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok
Home Assistant Supervisor host_os | Home Assistant OS 7.6 -- | -- update_channel | stable supervisor_version | supervisor-2022.04.0 docker_version | 20.10.9 disk_total | 14.0 GB disk_used | 7.3 GB healthy | true supported | true board | generic-x86-64 supervisor_api | ok version_api | ok installed_addons | File editor (5.3.3), ESPHome (2022.3.1), Signal Messenger (0.59.0), Terminal & SSH (9.3.0), Samba share (9.6.1), InfluxDB (4.4.1), Grafana (7.5.2)
Dashboards dashboards | 1 -- | -- resources | 0 views | 5 mode | storage

Additional information

No response

agners commented 2 years ago

The numerical GID ownership of the device is 18. Logs show that the Conbee is correctly detected by HAOS but ttyd keeps segfaulting.

ttyd is not related to the Conbee at all, its a daemon part of the SSH Add-on which allows to access the shell via browser. I've noticed the segfaults a while ago as well. It seems to be occurring when ttyd exits (e.g. a tab with the Web Terminal open is being closed). The issue is tracked here: https://github.com/home-assistant/addons/issues/2426

Possible wrong GID of dialout on HAOS (below, Ubuntu is my laptop on 20.04):

Agreed, the gid's are mixed between containers and the base system. It is not straight forward to solve the problem as different add-ons use different base OS (Alpine, Debian etc), which come with different user/group setup.

That said, currently it doesn't really matter as almost all add-ons run as root inside the container. So permission shouldn't matter.

To me it looks like opening the serial port succeeds without permission issue etc. The "device state timeout" doesn't seem to be a permission issue to me.

20:52:01:509 Announced to internet https://phoscon.de/discover
20:52:01:638 COM: /dev/serial/by-id/usb-dresden_elektronik_ingenieurtechnik_GmbH_ConBee_II_DE22XXXX-if00 / serialno: DE22XXXXX, ConBee II
20:52:01:759 Skip idle timer callback, too early: elapsed 946 msec
...
20:52:03:796 device state timeout ignored in state 2
remygithub commented 2 years ago

Agreed, the gid's are mixed between containers and the base system. It is not straight forward to solve the problem as different add-ons use different base OS (Alpine, Debian etc), which come with different user/group setup.

Wouldn't an appropriately modified /etc/group in the overlay for buildroot solve the UID/GID mismatch between the containers and the base system? Just a thought.

That said, currently it doesn't really matter as almost all add-ons run as root inside the container. So permission shouldn't matter.

That sounds quite scary if the container itself runs as root on the host... But thank you for finally clarifying that! BTW, using deconz in a regular Ubuntu system requires dialout membership.

To me it looks like opening the serial port succeeds without permission issue etc. The "device state timeout" doesn't seem to be a permission issue to me.

Any ideas of what to try next? I'm thinking of installing deconz on a regular system to check if the Conbee II device is the problem. Up to know I was convinced the device was fine since, as you also noticed, it was discovered and opened properly by HAOS.

agners commented 2 years ago

Wouldn't an appropriately modified /etc/group in the overlay for buildroot solve the UID/GID mismatch between the containers and the base system? Just a thought.

Agreed, for dialout it works for at least those two containers. As your list shows, different containers can use different IDs..

That sounds quite scary if the container itself runs as root on the host...

It doesn't has "root" full capabilities, and the container are isolated on various levels. But yeah it is user id 0, which has implications.

In the end I think of HAOS more like a router: It's a single purpose machine, and it needs certain rights to get the job done. But yeah, more isolation helps to improve security... Patches welcome :)

Any ideas of what to try next? I'm thinking of installing deconz on a regular system to check if the Conbee II device is the problem. Up to know I was convinced the device was fine since, as you also noticed, it was discovered and opened properly by HAOS.

Yeah not sure what that could be. Did that used to work on older HAOS version? Can you try 7.4?

ha os update --version=7.4
remygithub commented 2 years ago

Just tried it and having the same issue. Logs below.

Core Log

After reboot. Somehow ZHA is semi-configured because, normally, it does not appear in the integrations. After reboot, it pops up in the list of integrations suggesting adding a new device.

2022-05-02 19:56:23 WARNING (MainThread) [zigpy_deconz.api] No response to 'Command.device_state' command with seq id '0x02'
2022-05-02 19:56:25 WARNING (MainThread) [zigpy_zigate.api] No response to command 0x0002
2022-05-02 19:56:25 WARNING (MainThread) [zigpy_zigate.api] Retry command 0x0002
2022-05-02 19:56:27 WARNING (MainThread) [zigpy_xbee.api] at: No response to AP command

...after trying to add ZHA integration...

2022-05-02 20:09:00 WARNING (MainThread) [zigpy_deconz.api] No response to 'Command.device_state' command with seq id '0x02'
2022-05-02 20:09:01 WARNING (MainThread) [zigpy_zigate.api] No response to command 0x0002
2022-05-02 20:09:01 WARNING (MainThread) [zigpy_zigate.api] Retry command 0x0002
2022-05-02 20:09:04 WARNING (MainThread) [zigpy_xbee.api] at: No response to AP command
2022-05-02 20:09:43 WARNING (MainThread) [zigpy_deconz.api] No response to 'Command.device_state' command with seq id '0x02'

Host Log

Relevant lines

[    1.690607] usb 5-1: New USB device found, idVendor=1cf1, idProduct=0030, bcdDevice= 1.00
[    1.693037] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    1.695436] usb 5-1: Product: ConBee II
[    1.697822] usb 5-1: Manufacturer: dresden elektronik ingenieurtechnik GmbH
[    1.700251] usb 5-1: SerialNumber: DEXXXXX
...
[    2.631704] cdc_acm 5-1:1.0: ttyACM0: USB ACM device
[    2.640181] usbcore: registered new interface driver cdc_acm
[    2.642541] cdc_acm: USB Abstract Control Model driver for USB modems and ISDN adapters
agners commented 2 years ago

Wait I am confused, are you using the deCONZ Add-on or ZHA? You cannot use both at the same time, those two integrations conflict.

remygithub commented 2 years ago

I disabled deCONZ before trying to install ZHA for the first time. After a few unsuccessful tries to install ZHA I uninstalled deCONZ, so it was not on the system (I hope/think -- see below) when I ran the test yesterday. I was never able to install ZHA completely since it always errors out trying to connect to the port. So, normally, it doesn't appear in my list of integrations, just on reboots, I think.

Not sure how to uninstall whatever ZHA bits are on the system at this point. Or how to check that deCONZ is not there for sure.

I looked again at my backups and I have a partial of core_2022.3.8. I am pretty sure that zigbee was fine with that version of the core and deCONZ. I tried to restore it but it errors out: 22-05-03 21:53:29 ERROR (MainThread) [supervisor.backups.manager] 299bb7a9 is only a partial backup!

Are partials diffs so I would have to restore the full sequence? Or are they diffs to the nominal state of the system as installed from the repo? Could not find info on that.

Or do you think I should just try ha core update --version 2022.3.8

agners commented 2 years ago

22-05-03 21:53:29 ERROR (MainThread) [supervisor.backups.manager] 299bb7a9 is only a partial backup!

How do you restore? Through the UI? It seems that the Supervisor tries to restore it as a full backup when it fact its a partial backup.

I am lost what the problem is now. When you uinstalled deCONZ, all the configuration/settings will be lost, so a non working Zigbee is kinda expected....

However, if you have ZHA properly disabled (and restarted Core after that, just to make sure its really disabled), then deCONZ should work as before.

Did you try a clean deCONZ install, configure the add-on to use the deCONZ stick, start the add-on and pair a new device?

The logs above (specifically device state timeout ignored in state 2) doesn't look like a OS issue to me. It seems deCONZ runs and can communicate with the USB stick. I'd recommend checking with the deCONZ community if they have an idea what could be wrong.

remygithub commented 2 years ago

However, if you have ZHA properly disabled (and restarted Core after that, just to make sure its really disabled), then deCONZ should work as before.

I see no way to disable ZHA in the UI! I did try to add it several times but it always timed out trying to talk to the Conbee II. So maybe it is partially installed?!? It only appears when restarting core and says "new devices discovered". The core log shows that it always loads the component on startup: 2022-05-04 20:21:52 INFO (SyncWorker_4) [homeassistant.loader] Loaded zha from homeassistant.components.zha Again, I uninstalled the deconz add-on before doing these tests.

The logs above (specifically device state timeout ignored in state 2) doesn't look like a OS issue to me. It seems deCONZ runs and can communicate with the USB stick. I'd recommend checking with the deCONZ community if they have an idea what could be wrong.

I uninstalled the deCONZ add-on prior to testing this (see above).

I submitted a bug on core almost 3wks ago regarding my Conbee/deconz access problems but no solution yet. Will try again in parallel to this.

I tend to agree that, since the permissions are not an issue, it's unlikely that this is an OS issue. That said, I was wondering if I could get your feedback on the logs resulting from my tries below -- that's what you get for being responsive and helpful :smiley: :1st_place_medal:

I tried the following yesterday:

  1. downgraded core to 3.8
  2. restored the core_3.8 partial backup I had
  3. rebuild core
  4. enabled all related debug I could think of in config (pydeconz, homeassistant.components.deconz, homeassistant.components.zha, zigpy, zigpy_deconz.zigbee.application, zigpy_deconz.api, zhaquirks)
  5. Unplugged the Conbee, shutdown the host and plugged the Conbee back in a different USB3 port for more power
  6. start the host
  7. when ZHA presented "New devices found => Configure" card in the Devices and Integrations panel of the UI I clicked configure then submit,

Here is the resulting relevant part of the core log:

2022-05-04 22:40:16 DEBUG (MainThread) [zigpy_deconz.api] Command Command.device_state (0, 0, 0)
2022-05-04 22:40:18 WARNING (MainThread) [zigpy_deconz.api] No response to 'Command.device_state' command with seq id '0x02'
2022-05-04 22:40:18 DEBUG (MainThread) [zigpy_deconz.api] Unsuccessful radio probe of '/dev/serial/by-id/usb-dresden_elektronik_ingenieurtechnik_GmbH_ConBee_II_DE2217157-if00' port
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/zigpy_deconz/api.py", line 399, in probe
    await asyncio.wait_for(api._probe(), timeout=PROBE_TIMEOUT)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.9/site-packages/zigpy_deconz/api.py", line 415, in _probe
    await self.device_state()
  File "/usr/local/lib/python3.9/site-packages/zigpy_deconz/api.py", line 313, in _command
    return await asyncio.wait_for(fut, timeout=COMMAND_TIMEOUT)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-05-04 22:40:20 WARNING (MainThread) [zigpy_zigate.api] No response to command 0x0002
2022-05-04 22:40:20 WARNING (MainThread) [zigpy_zigate.api] Retry command 0x0002
2022-05-04 22:40:22 WARNING (MainThread) [zigpy_xbee.api] at: No response to AP command
...
2022-05-04 23:00:51 DEBUG (MainThread) [zigpy.appdb] SQLite version for <module 'sqlite3' from '/usr/local/lib/python3.9/sqlite3/__init__.py'>: 3.34.1
...
2022-05-05 02:32:22 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [140083354729248] Received invalid command: subscribe_entities

NB: I still have the zigbee related entities in home assistant (aqara temp humidty sensors) hoping to reuse that when reconnecting via zigbee. I hope that's not a problem. Maybe that's the last ERROR in the log shown above.

Found some mentions in the forums about Unsuccessful radio probe being solved by reflashing the firmware so I am going to try that next. If you see anything else that grabs your attention please let me know.

remygithub commented 2 years ago

I looks like my device might be broken.

@agners Thanks a lot for your help so far -- definitely learned a lot about HA!