Closed Lockszmith-GH closed 4 months ago
The main issue this is being tracked under is https://github.com/home-assistant/supervisor/issues/4381 it appears to happen only under bookworm and no root cause has been identified yet, the only thing I can say for certain is it seems to happen whenever the container is recreated and resolves itself with a restart, until the next supervisor update when the container is recreated yet again.
thanks for confirming @ikifar2012.
So it seems, at least at this point, I would ask the maintainers for a system state that will report this unhealthy state - something that I can send an alert on, and track.
See home-assistant/supervisor#4839 for my investigation of what concerning-things seems to be happening here
The "short" version of that is
hassio_supervisor
container:$ uptime
18:44:13 up 1 min, 1 user, load average: 1.65, 0.75, 0.28
$ docker logs hassio_supervisor 2>&1 | tail -2
24-01-25 18:44:04 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
24-01-25 18:44:04 INFO (MainThread) [supervisor.host.manager] Host information reload completed
$ ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: true
supported: true
version: 2023.12.1
version_latest: 2023.12.1
$ docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]
docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
xargs pstree -p | grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
xargs sudo ps Z
(or if you prefer / don't want to install pstree, the single-depth: docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid
gives
LABEL PID TTY STAT TIME COMMAND
unconfined 2503 ? Ss 0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
unconfined 2638 ? S 0:00 s6-supervise s6-linux-init-shutdownd
unconfined 2641 ? Ss 0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
unconfined 2666 ? S 0:00 s6-supervise s6rc-fdholder
unconfined 2667 ? S 0:00 s6-supervise s6rc-oneshot-runner
unconfined 2680 ? Ss 0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
unconfined 2722 ? S 0:00 s6-supervise supervisor
unconfined 2724 ? S 0:00 s6-supervise watchdog
unconfined 2726 ? Ss 0:00 bash /usr/bin/bashio ./run watchdog
unconfined 2727 ? Ssl 0:06 python3 -m supervisor
unconfined 7424 ? S 0:00 sleep 300
that is the apparmor profile is loaded BUT NOT APPLIED to any supervisor process
(or any other event which causes recreation of hassio_supervisor
container on a running system)
odroidn2:~:# docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
xargs pstree -p | grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
xargs ps Z
LABEL PID TTY STAT TIME COMMAND
hassio-supervisor (enforce) 9035 ? Ss 0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
hassio-supervisor (enforce) 9077 ? S 0:00 s6-supervise s6-linux-init-shutdownd
hassio-supervisor (enforce) 9079 ? Ss 0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
hassio-supervisor (enforce) 9086 ? S 0:00 s6-supervise s6rc-fdholder
hassio-supervisor (enforce) 9087 ? S 0:00 s6-supervise s6rc-oneshot-runner
hassio-supervisor (enforce) 9095 ? Ss 0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
hassio-supervisor (enforce) 9135 ? S 0:00 s6-supervise supervisor
hassio-supervisor (enforce) 9137 ? S 0:00 s6-supervise watchdog
hassio-supervisor (enforce) 9138 ? Ssl 0:03 python3 -m supervisor
hassio-supervisor (enforce) 9142 ? Ss 0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce) 9773 ? S 0:00 sleep 30
odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: false
supported: true
version: 2023.12.1
version_latest: 2023.12.1
odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged
odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep 'CRITICAL\|ERROR\|WARNING'
24-01-25 20:07:42 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.rauc] Host has no rauc support. OTA updates have been disabled.
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.agent] No OS-Agent support on the host. Some Host functions have been disabled.
24-01-25 20:07:52 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!
AppArmor profile now is in effect and not sufficient for (socket
(2)
as socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT)
inside of) udev_monitor_new_from_netlink
(3)
Thus there appear to be two important issues:
why is the apparmor profile as written not sufficient for this udev socket open on Debian 12
deny network raw
o nline 8? ; that or the policy needs to be being applied to the actual pyc in pyudev and/or libudev.so not just the python interpretor? or there is a new capability needed?)why is the apparmor profile not being applied to the container's proccesses on first-start after boot? (also no longer, for me, on siblings of ns pid1 spawned by docker exec
)
(is this a docker/containerd.bug? a kernel bug? a systemd bug?
(or is it the initialization chain of hassio-apparmor being race-y? As I demonstrated in this comment it does at least seem like containerd will not start the container when the profile is not loaded - but I'm thinking maybe there is a race between allocation of the name and restore of its contents? [and that apparmor modules short circuit application of a profile if its found empty at the time?])
I think I ended up leaving it out of the version posted in there, but its perhaps worth noting that with the profile applied, pyudev.Context().list_devices()
still works fine [that is, I believe: udenv_enumerate_new
(3)
and udev_list_entry_get_next
(3)
) in the supervisor process itself (with debug/debug_wait
]
so the problem seems to be potentially less the udev
part writ large then the (monitor_new_from_)netlink
part in particular
(udevadm monitor
in the supervisor container (as well as in the hassio_audio
container under docker-default
apparmor) still works - but that appears to be by virtue of the Ux
v ix
the policy designates it with)
speaking to OP's workaround:
For my part systemctl restart hassio-supervisor
et al are not sufficient (at least not consistently[^1]) to get from "bad" to "good", only a full system reboot - but this may have changed at some point since I started encountering the issue
Conversely, while I was writing https://github.com/home-assistant/supervisor/issues/4839 I could have sworn at some point a "soft" restart was initally but not repeatably sufficient to get from "good" to "bad" (but that could, for example, be something like /run/supervisor/startup-marker
causing a recreate instead of a docker stop
/ docker start
)
[^1]: though one early time I was looking into this, the act of editing config.json
or using ha supervisor options
to set debug
and debug_block
and then bootstrapping my first supervisor vscode environment did appear to immediately sidestep the failure--launching it to full hesienbug status. I do not know if that suggests potentially an even more subtle timing issue (such as last supervisor TCP_WAIT-ish socket cleanup?) or if I just somewhere that time I e.g. rebooted the machine without really noting it as part of what I did.
but stopping my thread-spamming, I would be very curious whether other people experiencing this sort of issue can replicate (or countermand) my catching the processes unconstrained
(using docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid
, the deeper pstree|sed
version, or just aa-status
/apparmor_status
though there you are looking for presence/absence of the hassio-supervisor
profile in the list of processes)
I just installed hassio supervised on a fresh debian (bookworm) system and happy to help debug, but much of this thread is over my head. What I can say with certainty is:
For my part
systemctl restart hassio-supervisor
et al are not sufficient (at least not consistently) to get from "bad" to "good", only a full system reboot
This was also my experience. Happy to run any commands and provide their output for corroboration.
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
https://github.com/home-assistant/version/pull/370 fixed https://github.com/home-assistant/supervisor/issues/4381. Is this issue fixed by the PR aswell?
I would fully expect so, but you know, it needs to roll out first
for what it’s worth I belatedly applied my suggested workaround of pulling the profile entirely and haven’t had an issue since, though technically that’s a different thing
home-assistant/version#370 fixed home-assistant/supervisor#4381. Is this issue fixed by the PR aswell?
Still awaiting this https://github.com/home-assistant/version/pull/372 to be merged as well... We will see after the next supervisor update
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
OS Version
Debian GNU/Linux 12 (bookworm)
System Information
Linux ha 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
What happened?
Whenever the supervisor is updated, the Unhealthy system - Not privileged. Resolving this is easy, all I need to do is to restart the supervisor service from the OS prompt - I actually have a cron job that does it one a day.
I've been reading the threads, and I was not able to pinpointed the root cause of this issue.
This bug report has 2 goals:
The only method I have right now is:
It works, but very crude. I put it in a cron job as this has become really annoying.
Machine Type
generic-x86-64
Installer output
Relevant log output
ADR
Code of Conduct