home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.8k stars 959 forks source link

Unsupported system - Systemd-Resolved issues on a fresh OdroidM1 install #2578

Closed chielbos closed 1 year ago

chielbos commented 1 year ago

Describe the issue you are experiencing

As already described in https://github.com/home-assistant/core/issues/92429 Petro on discord suggested opening a issue in here;

After fresh install with https://github.com/home-assistant/operating-system/releases/download/10.2/haos_odroid-m1-10.2.img.xz all works fine, until i reboot the system.

After the first reboot i get stuck with "Unsupported system - Systemd-Resolved issues"

What operating system image do you use?

odroid-m1 (Hardkernel ODROID-M1)

What version of Home Assistant Operating System is installed?

core-2023.5.4

Did you upgrade the Operating System.

No

Steps to reproduce the issue

-

Anything in the Supervisor logs that might be useful for us?

See other post in the mentioned link above.

Anything in the Host logs that might be useful for us?

.

System information

System Information

version core-2023.5.4
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.10.11
os_name Linux
os_version 6.1.29
arch aarch64
timezone Europe/Amsterdam
config_dir /config
Home Assistant Cloud logged_in | false -- | -- can_reach_cert_server | failed to load: unreachable can_reach_cloud_auth | failed to load: unreachable can_reach_cloud | failed to load: unreachable
Home Assistant Supervisor host_os | Home Assistant OS 10.2 -- | -- update_channel | stable supervisor_version | supervisor-2023.06.1 agent_version | 1.5.1 docker_version | 23.0.6 disk_total | 13.8 GB disk_used | 4.5 GB healthy | true supported | failed to load: Unsupported board | odroid-m1 supervisor_api | ok version_api | failed to load: unreachable installed_addons | Z-Wave JS (0.1.83), Advanced SSH & Web Terminal (15.0.2)
Dashboards dashboards | 1 -- | -- resources | 0 mode | auto-gen
Recorder oldest_recorder_run | 5 juni 2023 om 17:21 -- | -- current_recorder_run | 6 juni 2023 om 20:48 estimated_db_size | 0.77 MiB database_engine | sqlite database_version | 3.40.1

Additional information

No response

agners commented 1 year ago

@lukassadovsky it seems that Network Manager has issues getting a DHCP lease:

Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0321] device (eth0): carrier: link connected
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0329] device (eth0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0349] policy: auto-activating connection 'Wired connection 1' (2ed5f039-a578-34e7-98c5-2ddcf44db399)
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0364] device (eth0): Activation: starting connection 'Wired connection 1' (2ed5f039-a578-34e7-98c5-2ddcf44db399)
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0367] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0375] manager: NetworkManager state is now CONNECTING
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0381] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0396] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 20:18:18 homeassistant NetworkManager[558]: <info>  [1687378698.0412] dhcp4 (eth0): activation: beginning transaction (timeout in 45 seconds)
Jun 21 20:18:18 homeassistant systemd-resolved[556]: eth0: Bus client set MulticastDNS setting: yes

This is a very suspicious error:

Jun 21 20:18:18 homeassistant kernel: RTL8211F Gigabit Ethernet stmmac-0:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
Jun 21 20:18:18 homeassistant kernel: rk_gmac-dwmac fe2a0000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx

Can you try a different Ethernet cable?

It kinda supports what I've said previously in https://github.com/home-assistant/operating-system/issues/2578#issuecomment-1597185152: This issue is board dependent. A not correctly working Ethernet port would explain the problem.

I wonder if the OS Hardkernel provides (Debian) also shows such problems :thinking:

lukassadovsky commented 1 year ago

@agners I'll try another cable. But if I connect a laptop to the same cable, it works at 1Gbps.

JirikP commented 1 year ago

I can give it a go with usb to RJ45 adaptor. I have couple of them lying around. Will HAOS recognize them out of the box?

agners commented 1 year ago

I can give it a go with usb to RJ45 adaptor. I have couple of them lying around. Will HAOS recognize them out of the box?

Depends on the exact model, I remember I've enabled some in the past. YMMV, definitely wroth trying.

JirikP commented 1 year ago

I can give it a go with usb to RJ45 adaptor. I have couple of them lying around. Will HAOS recognize them out of the box?

Depends on the exact model, I remember I've enabled some in the past. YMMV, definitely wroth trying.

Still the same bug. The dongle works, but that is it.

lukassadovsky commented 1 year ago

Can you try a different Ethernet cable?

Different Ethernet cable works at 1Gbps

regan-a commented 1 year ago

Any theories why the mdns plugin is preventing coredns from starting properly?

gordio commented 1 year ago

Version: haos_odroid-m1-10.3.img.xz I found interesting behaviour. If I boot using nvme (with sd card to start kernel only) (check this method https://community.home-assistant.io/t/add-support-for-odroid-m1/408257/63?u=gordio) bug with dns happened rarely. If use just sd card - always. Is it something related to race? like running dns container before dbus/mounting fs or something like that…? Log of "broken boot": https://dpaste.org/MewXA

chielbos commented 1 year ago

Sigh..... 😮‍💨 - The saga appears to continue over here.

Today, our electrician came over to apply some change's (which required poweroff).

Upon restarting 🪄 the issue came back.

Home Assistant 2023.6.3 Supervisor 2023.06.4 Operating System 11.0.dev20230613 Frontend-versie: 20230608.0 - latest

Any actions i can perform to resolve the issue on my end, or help identifying the problem?

Update; After rebooting 3 times (via ui) out of pure desperation, it works again. It feels like black magic 🧙🏽‍♀️

agners commented 1 year ago

I've made some progress here, this gets a bit technical and is more meant as a note to myself for the records :sweat_smile:

Currently I am booting into an emergency shell with cmdline.txt set to systemd.log_level=debug systemd.unit=emergency.target. Up to this shell my system and the system with problems boot identically. From the logs I've found that udisks2.service was slow to start. Further investigation show that it is dbus-broker.service which initially seems to fail. A very minimal start sequence emulates the problematic start:

systemctl start systemd-journald.socket
systemctl start dbus-broker.service

The second command takes 90s+ to respond on a problematic board, which triggers a systemd timeout:

Jul 04 14:53:53 homeassistant systemd[1]: dbus-broker.service: Main process exited, code=killed, status=9/KILL
Jul 04 14:53:53 homeassistant systemd[1]: dbus-broker.service: Failed with result 'timeout'.

Checking the current syscall of the offending process reveals:

~ # ps -p $(pidof dbus-broker-launch) -ocmd,stat,wchan
CMD                         STAT WCHAN
/usr/bin/dbus-broker-launch Ss   wait_for_random_bytes

The dbus-broker.service gets restarted again and succeeds, but from what I understand this causes udisks2.service to fail and subsequently other services to not correctly start.

I am not clear why my board doesn't show the same problem. My first thought was that #2575 is the culprit, but since this was already a problem with 10.2, and PR #2575 came in with 10.3, this can't really be (unless we trace two independent problems here :thinking: ).

agners commented 1 year ago

Meanwhile I've proven that the entropy pool is the problem indeed. On the problematic board I can see that the entropy pool as reported by the kernel starts around ~50 and grows very slowly:

watch -n 1 cat /proc/sys/kernel/random/entropy_avail

On the working board the entropy pool ramps up much quicker. It still takes about 5-10s to reach 256. At 256 the service dbus-broker.service then launches successful.

Currently it is not clear to me why that is the case. The kernel adds various identifiers (e.g. of SD cards and USB devices) to the entropy pool via add_device_randomness. It could be that the SD card's identifier used doesn't/does not provide as much entropy. However, this is a rather wild guess.

In any case, I am currently working on integrating a driver for the hardware random number generator. With that there will be enough entropy available very quickly, which seems to solve the boot problem.

It might also be sensible to improve the service ordering: In particular, udisks2.service should depend on the dbus.socket. However, it seems that the socket and the dbus-broker.service are not directly linked, so this alone won't resolve the issue.

I was wondering why dbus-broker requires randomness. It seem that the XML library libexpat is at fault. I am not entirly sure why quality entropy is need for an XML parser, but it seems the library really insists on it (see https://github.com/libexpat/libexpat/blob/R_2_5_0/expat/lib/xmlparse.c#L128).

agners commented 1 year ago

For the ordering problem I've opened an issue in the dbus-broker project https://github.com/bus1/dbus-broker/issues/319. Maybe there is a better way to express ordering for this issue.

agners commented 1 year ago

With #2578 and #2633 there are now two fixes for this issue. They will be part of the next 10.4 release. The latest dev version 11.0.dev20230705 already has the first fix (see https://os-builds.home-assistant.io/).

chielbos commented 1 year ago

Hi Angers,

Running the new version for a while now.

Works like a charm! Thanks for putting in the effort!