Open donaldguy opened 9 months ago
Extracting out this bit of ~speculation to its own comment:
I didn't know. It might be a kernel bug, or a docker bug, or a systemd bug. Or it might just be a good old fashioned race condition?
At the highest level these are activating in the right order
odroidn2:~:# systemctl status apparmor | grep active
Active: active (exited) since Thu 2024-01-25 18:42:39 MST; 19min ago
odroidn2:~:# systemctl status hassio-apparmor | grep active
Active: active (exited) since Thu 2024-01-25 18:42:40 MST; 19min ago
odroidn2:~:# systemctl status hassio-supervisor | grep active
Active: active (running) since Thu 2024-01-25 18:42:44 MST; 19min ag
this chain is there:
hassio-supervisor.service
And I can confirm that after a systemctl stop hassio-supervisor && docker rm -f $(docker ps -aq) && aa-remove-unknown && systemctl start hassio-supervisor
, you end up at a-different-than-this-issue supervisor restart loop involving such journal lines as
audit[8358]: AVC apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" hassio-supervisor" pid=15881 comm="runc:[2:INIT]" audit: type=1400 audit(1706235970.127:70): apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" name="hassio-supervisor" pid=15881 comm="runc:[2:INIT]" ... hassio-supervisor[15823]: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: unable to apply apparmor profile: apparmor failed to apply profile: write /proc/self/attr/apparmor/exec: no such file or directory: unknown hassio-supervisor[15823]: Error: failed to start containers: hassio_supervisor
followed by re-pulling the image and trying again fruitlessly in a loop which terminates only when you run systemctl start hassio-apparmor
(at which point you end up landing in the profile-applied-to-supervisor-so-it-doesn't work state, until system reboot)
But clearly there is something going (enabalingly) wrong in the application of profile from cold start of an existing hassio_supervisor container
So there may be value (once the profile is not breaking the udev function) in adding to hassio-apparmor.service
(and indeed potentially hassos-apparmor.service
something like a
ExecStartPre=/usr/bin/aa-enabled -q
and/or indeed fully move the sort of apparmor_parser
call which is the heart of these services to be instead a ExecStartPre
of the supervisor
Having the same issue
24-02-22 15:57:00 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
Removal of app-armor from supervisor compose file eliminates this error, HA updates work as expected too. Thus I suppose the issue is in app-armor supervisor's profile, as originally reported.
specifically, as I sort of said, believe the issue with the policy is,
for some reason now-and-not-before, the interpretation of deny network raw
https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L8
now applying also to netlink
domain, meaning the
I don't believe just adding an allow network netlink raw
will work ... (but maybe it will with the right ordering?)
I think giving it a cut out will require instead saying like...
deny network inet raw,
deny network ax25 raw,
deny network ipx raw,
deny network appletalk raw,
deny network netrom raw,
deny network bridge raw,
deny network atpmvc raw,
deny network x25 raw,
deny network inet6 raw,
deny network rose raw,
deny network netbui raw,
deny network security raw,
deny network key raw,
deny network packet raw,
deny network ash raw,
deny network econet raw,
deny network atmsvc raw,
deny network sna raw,
deny network irda raw,
deny network ppox raw,
deny network wanpipe raw,
deny network bluetooth raw,
deny network unix raw,
deny network rds raw,
deny network llc raw,
deny network can raw,
deny network tipc raw,
deny network iucy raw,
deny network rxrpc raw,
deny network isdn raw,
deny network phonet raw,
deny network ieee802154 raw,
deny network caif raw,
deny network alg raw,
deny network nfc raw,
deny network vsock raw,
deny network mpls raw,
deny network ib raw,
Of course ... the logical question there is ... does it actually need non-raw access to most of those domains? (probs no)
so probs the sweeping [allow] network
[] [] [*] of line 7 probs should go
maybe want like(?)
allow network inet tcp,
allow network inet6 tcp,
allow network bluetooth,
allow network netlink,
(maybe bridge? unix?)
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
Thanks for this in depth analyzes!
I was able to reproduce the original symptom (Supervisor getting unhealthy) on Supervised on a Debian 12 Bookworm.
I then was actually wondering if I can reproduce this with HAOS 12.2.rc1, specifically because it comes with new Buildroot which pushes it closer to what Debian Bookworm is using (specifically, Docker 25). However, I was not able to reproduce the problem on HAOS 12.2.rc1.
Which really made me question, what is different between Debian 12 and HAOS 12.2.rc1 :thinking:
Just like in Supervised, the AppArmor profile only gets applied on initial Supervisor container start on HAOS 12.2.rc1 as well. So this part seems to (mis)behave exactly the same way.
HAOS uses AppArmor 3.1.2 already in HAOS 11, no update has taken place there. Debian 12 uses 3.0.8. I looked through the changes between the two versions, but nothing caught my eyes.
So far I am a bit clueless :cry:
I feel it should always have not worked (as in, deny network raw
should always lead to deny network netlink raw
).
In any case, I think applying an explicit allow protocols as you suggested make sense. We don't need Bluetooth in Supervisor, so I think I'd go with:
network inet stream,
network inet6 stream,
network inet dgram,
network inet6 dgram,
network netlink raw,
I am still lost as to why Debian behaves different than HAOS here :cry:
A few things to note: Docker exec seems to not apply the AppArmor profile:
hassio-supervisor (enforce) root 17131 0.2 13.8 192868 136592 ? Ssl 15:32 0:00 python3 -m supervisor
hassio-supervisor (enforce) root 17132 0.0 0.3 4160 3516 ? Ss 15:32 0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce) root 17151 0.0 0.0 1612 4 ? S 15:32 0:00 sleep 300
unconfined root 17325 0.0 0.4 10008 4488 pts/1 S+ 15:33 0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined root 17326 0.0 0.0 10008 480 pts/6 Ss 15:33 0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined root 17327 0.0 2.8 1770644 28048 pts/6 Sl+ 15:33 0:00 docker exec -it hassio_supervisor /bin/bash
unconfined root 17342 0.0 0.2 2616 2288 pts/0 Ss 15:33 0:00 /bin/bash
Hence my udevadm command always worked, even with the uncorrected profile.
The AppArmor rules processing seems not very well defined to me, especially for networking. The syntax is:
NETWORK RULE = [ QUALIFIERS ] 'network' [ DOMAIN ] [ TYPE | PROTOCOL ]
The two rules so far:
network,
deny network raw,
Again, technically allowed all networking, but then disallowed all raw
protocol for all domains, even for netlink
.
In any case, raw seems to be disallowed by default, so allowing only netlink explicitly seems to be the way to go. It seems that unix
domain is also allowed by default. I did not found where this defaults are specified. But it is probably better to enable it explicitly too, so I end up with:
network unix stream,
network inet stream,
network inet6 stream,
network inet dgram,
network inet6 dgram,
network netlink raw,
I'll create a PR to update the profile, for dev and beta channel first.
just to confirm I was bitten by the above deny rule. I changed to what @agners suggested and all is healthy.
TL;DR basically like the title says: I have discovered:
hassio-supervisor
apparmor policy from being actually applied/enforced on any process in an existing-on-boothassio_supervisor
container (with e.g.ps auZ
showing them "unrestricted", and `apparmor_status listing none of them)Describe the issue you are experiencing
Until pretty recently after many hours of investigating and writing I did not have a specific proposed pathology (had only
the "disease"
), but I have found it (the worrying "cure"
) and it raises potentially greater concerns, including both how it interacts with the udev issue and that it occurs at all.the "disease"
As captured in the log-snippets below (and at root in #4381 and potentially related to continuing issues with #4827 after release of docker 25.0.1), the eventual problem is that (reliably after container recreation, so e.g. supervisor update, as well as potentially after some supervisor restarts in my testing, and also apparently just in letting HA run sometimes?) this code goes down the unhappy path:
https://github.com/home-assistant/supervisor/blob/d24543e1039c84d9c8c15aa66204568d98cf6f03/supervisor/hardware/monitor.py#L29-L43
producing a false-positive (or at least transiently-valid) claim of running an unsupported/unhealthy configuration, failing by ~short-circuit under the "privileged" check, owing to what I would assert is an erroneous re-usage of that error code in the network code, as
remains the case at every point.
Against an already running core, this also blocks add-on (and core) upgrades even as the supervisor may be de facto functioning adequately (for most purposes, presumably it would fail at any responsibilities specifically related to hardware (change) monitoring)
the worrying "cure"
With limited exceptions (when running Docker 25.0.0 and hitting other problems) I have generally found that rebooting the entire machine will solve my problem… for a while.
I did not know why, even as I started to dig into this udev issue and hone in on the specifics within libudev
but as it turns out, after I reboot, the loaded
hassio-supervisor
profile is simply not applied to the containerthis being in contrast to the other
hassio_*
containers which are instead/at least underdocker-default
profile. (by virtue having no specified alternative profile)What type of installation are you running?
Home Assistant Supervised
affecting supported systems?
I am of the impression that supervised-installer run on Debian Bookworm is considered supported, so ... then yes
But even if it is not, whereas there seems to be a reproducible critical failure upon the application of the stable apparmor profile at all, the issue may very well be waiting in the wings for HAOS
apparmor
,systemd
/sytemd-udev
;I hope y'all take this seriously given the many hours [more than I meant to] I have put into it
but I will put my cards on the table, admitting I might be of spec somewhere (I believe only in an older kernel, but let me know if you see something else):
```shell $ uname -a Linux odroidn2 6.1.63-current-meson64 #1 SMP PREEMPT Mon Nov 20 10:52:19 UTC 2023 aarch64 GNU/Linux $ sudo dpkg-query -l | grep -E 'apparmor|cifs-utils|curl|dbus|jq|libglib2.0-bin|lsb-release|network-manager|nfs-common|systemd|udisks2|wget|^linux|docker|homeassistant' hii apparmor 3.0.8-3 arm64 user-space parser utility for AppArmor ii cifs-utils 2:7.0-2 arm64 Common Internet File System utilities ii curl 7.88.1-10+deb12u5 arm64 command line tool for transferring data with URL syntax ii dbus 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (system message bus) ii dbus-bin 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (command line utilities) ii dbus-broker 33-1 arm64 Linux D-Bus Message Broker ii dbus-daemon 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (reference message bus) ii dbus-session-bus-common 1.14.10-1~deb12u1 all simple interprocess messaging system (session bus configuration) ii dbus-system-bus-common 1.14.10-1~deb12u1 all simple interprocess messaging system (system bus configuration) ii dbus-user-session 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (systemd --user integration) ii docker-ce 5:25.0.1-1~debian.12~bookworm arm64 Docker: the open-source application container engine ii docker-ce-cli 5:25.0.1-1~debian.12~bookworm arm64 Docker CLI: the open-source application container engine ii homeassistant-supervised 1.5.0 all Home Assistant Supervised ii jq 1.6-2.1 arm64 lightweight and flexible command-line JSON processor ii libapparmor1:arm64 3.0.8-3 arm64 changehat AppArmor library ii libcurl3-gnutls:arm64 7.88.1-10+deb12u5 arm64 easy-to-use client-side URL transfer library (GnuTLS flavour) ii libcurl4:arm64 7.88.1-10+deb12u5 arm64 easy-to-use client-side URL transfer library (OpenSSL flavour) ii libdbus-1-3:arm64 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (library) ii libglib2.0-bin 2.74.6-2 arm64 Programs for the GLib library ii libjq1:arm64 1.6-2.1 arm64 lightweight and flexible command-line JSON processor - shared library ii libpam-systemd:arm64 252.19-1~deb12u1 arm64 system and service manager - PAM module ii libsystemd-shared:arm64 252.19-1~deb12u1 arm64 systemd shared private library ii libsystemd0:arm64 252.19-1~deb12u1 arm64 systemd utility library ii libudisks2-0:arm64 2.9.4-4 arm64 GObject based library to access udisks2 ii lsb-release 12.0-1 all Linux Standard Base version reporting utility (minimal implementation) ii network-manager 1.42.4-1 arm64 network management framework (daemon and userspace tools) ii network-manager-openvpn 1.10.2-2 arm64 network management framework (OpenVPN plugin core) ii networkd-dispatcher 2.2.3-1 all Dispatcher service for systemd-networkd connection status changes ii nfs-common 1:2.6.2-4 arm64 NFS support files common to client and server ii python3-dbus 1.3.2-4+b1 arm64 simple interprocess messaging system (Python 3 interface) ii systemd 252.19-1~deb12u1 arm64 system and service manager ii systemd-journal-remote 252.19-1~deb12u1 arm64 tools for sending and receiving remote journal logs ii systemd-resolved 252.19-1~deb12u1 arm64 systemd DNS resolver ii systemd-sysv 252.19-1~deb12u1 arm64 system and service manager - SysV compatibility symlinks ii udisks2 2.9.4-4 arm64 D-Bus service to access and manipulate storage devices ii wget 1.21.3-1+b1 arm64 retrieves files from the web ```uname -a
and filtereddpkg-query -l
Which operating system are you running on?
Debian
Steps to reproduce the issue
Reboot host machine (with a
hassio_supervisor
container already created. If I stop and delete before reboot, the initial creation is "broken". An additional reboot achieves)Observer that supervisor comes up reporting as in healthy state;
optionally: proceed in blissful ignorance as to why/how
Alternatively ruin your sense of well-being by looking into the apparmor status of that-there container
3.5. Confirm that in fact there is such profile actually
3.75. Try various methods to simply restart the container [
ha supervisor restart
,systemctl restart hassio-supervisor
or evensystemctl stop hassio-supervisor
wait a whilesystemctl start hassio-supervisor
... and oops actually everything was "fine";I kinda swear at least once it broke on restart, but I might be misremembering, or hit one of the edges suggested below)
still "fine" is all the processes remain unconstrained from the apparmor profile intended for the,
Okay, but anything which actually recreates the container (as
/usr/sbin/hassio-supervisor
will do when the startup-marker (/run/supervisor/startup-marker
) is left in place (so that is the supervisor crashes early?), or indeed as outlined above when the profile is missing initially, or on supervisor update; or if simply someone/something runsdocker rm
on the stopped container)So yeah we can do
... for example
And now we will find our error state of interest
What has changed? I mean you know, but its
i.e.
We can go deeper actually. You can see most the same thing by starting with debug and debug_wait and using the and attaching from vscode; but to really get there - lets more awkwardly do
This should (still[^1]) give you into an interactive shell that is set up like the supervisor, including (which running our
docker inspect ...
pipeline can verify) the apparmor profile.[^1]: When I first investigated this on Tuesday (with docker 25.0.0 installed) it was sufficient to do
docker exec -it hassio_supervisor bash
to repro as below, but something has changed and now that generates a shell sibling to the pid1 as unrestricted instead of also getting the apparmor profile. Bafflingly at one point Tuesday this behavior seemed to change after I ransystemd-analyze set-log-level debug
)From here its instructive to do 3 things (restarting the shell between or not as you prefer):
python3 -m supervisor
) and see it fail same as when deamonized (includingCRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
)udevadm monitor
, and discover that it works perfectly (prints the headers and will respond if I go e.g. plug and unplug USB devices)Attempt my minimal repro case, expanded from the real line via the pyudev source by starting a
python
repl, thenand like the supervisor, you'll get
Why do these things? Well, it reveals that the apparmor policy is in effect AND is killing udev connect for the supervisor (and the python interpreter in general, and most things).
How do I know that? Because udevadm still works. It is calling the same function (
udev_monitor_new_from_netlink
) of the same library (from alpine'seudev-libs
3.2.11-r8): https://github.com/eudev-project/eudev/blob/611b6fbae2cca9ceeacb820befb3a0e8caec88b8/src/udev/udevadm-monitor.c#L214(Unless in fact pyudev's wheel is bringing along its own copy of e.g. systemd libudev; but I don't see evidence of that)
So it follows that the only reason that udevadm still works when the other two don't is because the apparmor policy gives it an unrestricted pass:
where python gets no such
Additional information
So what is actually failing?
the
libudev
callit's apparently
udev_monitor_new_from_netlink
(3) which is unexpectedly returning a NULL pointer and settingerrno
to 13 ("Access denied")eudev-libs
3.2.11-r8. meaning it is a simple wrapper around itsudev_monitor_new_from_netlink_fd
function, whose source begins at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L166[^2]: Up to writing this section that didn't click and I was looking at systemd-udev's version ; and it seems possible that the mismatch (with the offered udevd) could be in play as the issue. But the code between it (eudev) and the systemd versions of it and the nested
device_monitor_new_full
, do not differ greatly, though perhaps systemd's version'sif (DEBUG_LOGGING) {
gated stanza's may end up instructive]That's pretty much, at core, 3 syscalls:
access
on/run/udev/control
- which,ls -l
shows itsrw------- root root
and a stat also passedsocket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT)
Indeed I can just now say with some confidence it is this syscall, because I belatedly discovered that pyudev does hook a few log priority functions and
That error appearing to be generated at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L203-L206 :
getsockname(udev_monitor->sock, &snl.sa, &addrlen);
So... what's the problem? I don't know. Most of the internet agrees that a
net_admin
capability ought to clear the way for interfacing with netlinkIf I had to guess, I'd point at maybe the:
at https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L7-L8
hitting up against the
SOCK_RAW
flag on the call ... but again nothing here has changed recently? or has it?Anyway ... this turned into 8 hours of investigation & writing today on top of the 5 of investigation and 3 of probs useless PR (#4834) on Tuesday; ADHD be like that.
the rest of the standard bug form stuff
### Anything in the Supervisor logs that might be useful for us? ```txt ... CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor! ... System is running in an unhealthy state and needs manual intervention! #after restart ... CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor! ... (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3 24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed 24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete 24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged ``` ### System Health information ## System Information version | core-2024.1.5 -- | -- installation_type | Home Assistant Supervised dev | false hassio | true docker | true user | root virtualenv | false python_version | 3.11.6 os_name | Linux os_version | 6.1.63-current-meson64 arch | aarch64 timezone | America/Denver config_dir | /configHome Assistant Community Store
GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok GitHub API Calls Remaining | 5000 Installed Version | 1.34.0 Stage | running Available Repositories | 1380 Downloaded Repositories | 19Home Assistant Supervisor
host_os | Debian GNU/Linux 12 (bookworm) -- | -- update_channel | stable supervisor_version | supervisor-2023.12.1 agent_version | 1.6.0 docker_version | 25.0.1 disk_total | 915.8 GB disk_used | 10.8 GB healthy | failed to load: Unhealthy supported | true supervisor_api | failed to load: unreachable version_api | ok installed_addons | Studio Code Server (5.15.0), ESPHome (2023.12.8), Mosquitto broker (6.4.0), Tailscale (0.14.0), Music Assistant BETA (2.0.0b84), Zigbee2MQTT (1.35.1-1), Govee to MQTT Bridge (2024.01.21-33e8cd7c)Dashboards
dashboards | 1 -- | -- resources | 12 views | 5 mode | storageRecorder
oldest_recorder_run | January 16, 2024 at 9:45 PM -- | -- current_recorder_run | January 25, 2024 at 7:53 PM estimated_db_size | 290.71 MiB database_engine | sqlite database_version | 3.41.2