home-assistant / supervised-installer

Installer for a generic Linux system
Apache License 2.0
1.73k stars 588 forks source link

Bug Report: Supervised system is reported as Unhealthy system - not privileged every time after supervisor upgrade. #342

Closed Lockszmith-GH closed 4 months ago

Lockszmith-GH commented 11 months ago

OS Version

Debian GNU/Linux 12 (bookworm)

System Information

Linux ha 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux

What happened?

Whenever the supervisor is updated, the Unhealthy system - Not privileged. Resolving this is easy, all I need to do is to restart the supervisor service from the OS prompt - I actually have a cron job that does it one a day.

I've been reading the threads, and I was not able to pinpointed the root cause of this issue.

This bug report has 2 goals:

  1. If there is a known root cause, one that I can test for, idneitfy it and remove it - I would love to know.
  2. I would really want a sensor stating whether the system is healthy or not, then I can automate log collection, and probably workaround the issue, also be able to pinpoint the exact set of curcumstances this is happening.

The only method I have right now is:

docker logs hassio_supervisor 2>&1 \
| grep -q "System is running in an unhealthy state and needs manual intervention!" \
&& sudo systemctl restart hassio-supervisor.service

It works, but very crude. I put it in a cron job as this has become really annoying.

Machine Type

generic-x86-64

Installer output

N/A

Relevant log output

Dec 19 10:05:54 ha hassio_supervisor[621]: 23-12-19 10:05:54 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!

ADR

Code of Conduct

ikifar2012 commented 11 months ago

The main issue this is being tracked under is https://github.com/home-assistant/supervisor/issues/4381 it appears to happen only under bookworm and no root cause has been identified yet, the only thing I can say for certain is it seems to happen whenever the container is recreated and resolves itself with a restart, until the next supervisor update when the container is recreated yet again.

Lockszmith-GH commented 10 months ago

thanks for confirming @ikifar2012.

So it seems, at least at this point, I would ask the maintainers for a system state that will report this unhealthy state - something that I can send an alert on, and track.

donaldguy commented 9 months ago

See home-assistant/supervisor#4839 for my investigation of what concerning-things seems to be happening here

donaldguy commented 9 months ago

The "short" version of that is

after a reboot with an existing hassio_supervisor container:

$ uptime
 18:44:13 up 1 min,  1 user,  load average: 1.65, 0.75, 0.28
$ docker logs hassio_supervisor 2>&1 | tail -2
24-01-25 18:44:04 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
24-01-25 18:44:04 INFO (MainThread) [supervisor.host.manager] Host information reload completed
$ ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: true
supported: true
version: 2023.12.1
version_latest: 2023.12.1
$ docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]

big "but"

docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
    xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
    xargs sudo ps Z

(or if you prefer / don't want to install pstree, the single-depth: docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid

gives

LABEL                               PID TTY      STAT   TIME COMMAND
unconfined                         2503 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
unconfined                         2641 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
unconfined                         2680 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
unconfined                         2722 ?        S      0:00 s6-supervise supervisor
unconfined                         2724 ?        S      0:00 s6-supervise watchdog
unconfined                         2726 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
unconfined                         2727 ?        Ssl    0:06 python3 -m supervisor
unconfined                         7424 ?        S      0:00 sleep 300

that is the apparmor profile is loaded BUT NOT APPLIED to any supervisor process

but after supervisor upgrade

(or any other event which causes recreation of hassio_supervisor container on a running system)

odroidn2:~:# docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
  xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
  xargs ps Z
LABEL                               PID TTY      STAT   TIME COMMAND
hassio-supervisor (enforce)        9035 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
hassio-supervisor (enforce)        9077 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
hassio-supervisor (enforce)        9079 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
hassio-supervisor (enforce)        9086 ?        S      0:00 s6-supervise s6rc-fdholder
hassio-supervisor (enforce)        9087 ?        S      0:00 s6-supervise s6rc-oneshot-runner
hassio-supervisor (enforce)        9095 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
hassio-supervisor (enforce)        9135 ?        S      0:00 s6-supervise supervisor
hassio-supervisor (enforce)        9137 ?        S      0:00 s6-supervise watchdog
hassio-supervisor (enforce)        9138 ?        Ssl    0:03 python3 -m supervisor
hassio-supervisor (enforce)        9142 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce)        9773 ?        S      0:00 sleep 30

odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: false
supported: true
version: 2023.12.1
version_latest: 2023.12.1

odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged

odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep 'CRITICAL\|ERROR\|WARNING'
24-01-25 20:07:42 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.rauc] Host has no rauc support. OTA updates have been disabled.
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.agent] No OS-Agent support on the host. Some Host functions have been disabled.
24-01-25 20:07:52 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!

AppArmor profile now is in effect and not sufficient for (socket (2) as socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT) inside of) udev_monitor_new_from_netlink (3)

donaldguy commented 9 months ago

Thus there appear to be two important issues:

  1. why is the apparmor profile as written not sufficient for this udev socket open on Debian 12

    • (while remaining sufficient on HAOS ... one assumes/hopes).
    • My best guess (which I should test, but haven't yet, is that its that deny network raw o nline 8? ; that or the policy needs to be being applied to the actual pyc in pyudev and/or libudev.so not just the python interpretor? or there is a new capability needed?)
  2. why is the apparmor profile not being applied to the container's proccesses on first-start after boot? (also no longer, for me, on siblings of ns pid1 spawned by docker exec)

    • (is this a docker/containerd.bug? a kernel bug? a systemd bug?

    • (or is it the initialization chain of hassio-apparmor being race-y? As I demonstrated in this comment it does at least seem like containerd will not start the container when the profile is not loaded - but I'm thinking maybe there is a race between allocation of the name and restore of its contents? [and that apparmor modules short circuit application of a profile if its found empty at the time?])

donaldguy commented 9 months ago

I think I ended up leaving it out of the version posted in there, but its perhaps worth noting that with the profile applied, pyudev.Context().list_devices() still works fine [that is, I believe: udenv_enumerate_new (3) and udev_list_entry_get_next(3)) in the supervisor process itself (with debug/debug_wait]

so the problem seems to be potentially less the udev part writ large then the (monitor_new_from_)netlink part in particular

(udevadm monitor in the supervisor container (as well as in the hassio_audio container under docker-default apparmor) still works - but that appears to be by virtue of the Ux v ix the policy designates it with)

donaldguy commented 9 months ago

speaking to OP's workaround:

For my part systemctl restart hassio-supervisor et al are not sufficient (at least not consistently[^1]) to get from "bad" to "good", only a full system reboot - but this may have changed at some point since I started encountering the issue

Conversely, while I was writing https://github.com/home-assistant/supervisor/issues/4839 I could have sworn at some point a "soft" restart was initally but not repeatably sufficient to get from "good" to "bad" (but that could, for example, be something like /run/supervisor/startup-marker causing a recreate instead of a docker stop / docker start)

[^1]: though one early time I was looking into this, the act of editing config.json or using ha supervisor options to set debug and debug_block and then bootstrapping my first supervisor vscode environment did appear to immediately sidestep the failure--launching it to full hesienbug status. I do not know if that suggests potentially an even more subtle timing issue (such as last supervisor TCP_WAIT-ish socket cleanup?) or if I just somewhere that time I e.g. rebooted the machine without really noting it as part of what I did.

donaldguy commented 9 months ago

but stopping my thread-spamming, I would be very curious whether other people experiencing this sort of issue can replicate (or countermand) my catching the processes unconstrained

(using docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid, the deeper pstree|sed version, or just aa-status/apparmor_status though there you are looking for presence/absence of the hassio-supervisor profile in the list of processes)

rlue commented 9 months ago

I just installed hassio supervised on a fresh debian (bookworm) system and happy to help debug, but much of this thread is over my head. What I can say with certainty is:

For my part systemctl restart hassio-supervisor et al are not sufficient (at least not consistently) to get from "bad" to "good", only a full system reboot

This was also my experience. Happy to run any commands and provide their output for corroboration.

github-actions[bot] commented 7 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

okin commented 7 months ago

https://github.com/home-assistant/version/pull/370 fixed https://github.com/home-assistant/supervisor/issues/4381. Is this issue fixed by the PR aswell?

donaldguy commented 6 months ago

I would fully expect so, but you know, it needs to roll out first

for what it’s worth I belatedly applied my suggested workaround of pulling the profile entirely and haven’t had an issue since, though technically that’s a different thing

ikifar2012 commented 6 months ago

home-assistant/version#370 fixed home-assistant/supervisor#4381. Is this issue fixed by the PR aswell?

Still awaiting this https://github.com/home-assistant/version/pull/372 to be merged as well... We will see after the next supervisor update

github-actions[bot] commented 4 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.