home-assistant / supervisor

:house_with_garden: Home Assistant Supervisor
https://home-assistant.io/hassio/
Apache License 2.0
1.66k stars 607 forks source link

supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839

Open donaldguy opened 5 months ago

donaldguy commented 5 months ago

TL;DR basically like the title says: I have discovered:

  1. that the supervisor has only ever(? recently-ever?) been working/starting for me because some bug on (re)boot is preventing the hassio-supervisor apparmor policy from being actually applied/enforced on any process in an existing-on-boot hassio_supervisor container (with e.g. ps auZ showing them "unrestricted", and `apparmor_status listing none of them)
  2. As soon as something causes the policy to actually be applied (most reproducibly a re-creation of the container, but possibly others) - the system cannot be healthy due to udev permission failures; as in #4381

Describe the issue you are experiencing

Until pretty recently after many hours of investigating and writing I did not have a specific proposed pathology (had only the "disease"), but I have found it (the worrying "cure") and it raises potentially greater concerns, including both how it interacts with the udev issue and that it occurs at all.

the "disease"

As captured in the log-snippets below (and at root in #4381 and potentially related to continuing issues with #4827 after release of docker 25.0.1), the eventual problem is that (reliably after container recreation, so e.g. supervisor update, as well as potentially after some supervisor restarts in my testing, and also apparently just in letting HA run sometimes?) this code goes down the unhappy path:

https://github.com/home-assistant/supervisor/blob/d24543e1039c84d9c8c15aa66204568d98cf6f03/supervisor/hardware/monitor.py#L29-L43

producing a false-positive (or at least transiently-valid) claim of running an unsupported/unhealthy configuration, failing by ~short-circuit under the "privileged" check, owing to what I would assert is an erroneous re-usage of that error code in the network code, as

$ docker inspect hassio_supervisor -f '{{.HostConfig.Privileged}}'
true

remains the case at every point.

Against an already running core, this also blocks add-on (and core) upgrades even as the supervisor may be de facto functioning adequately (for most purposes, presumably it would fail at any responsibilities specifically related to hardware (change) monitoring)

the worrying "cure"

With limited exceptions (when running Docker 25.0.0 and hitting other problems) I have generally found that rebooting the entire machine will solve my problem… for a while.

I did not know why, even as I started to dig into this udev issue and hone in on the specifics within libudev

but as it turns out, after I reboot, the loaded hassio-supervisor profile is simply not applied to the container

$ docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]

$ aa-status | grep hassio-supervisor
   hassio-supervisor
   hassio-supervisor///usr/bin/gdbus
   hassio-supervisor///usr/bin/git

# the shortness of which alone should hint the issue, but we can confirm

$ docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid
LABEL                               PID TTY      STAT   TIME COMMAND
LABEL                               PID TTY      STAT   TIME COMMAND
unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
unconfined                         2722 ?        S      0:00 s6-supervise supervisor
unconfined                         2724 ?        S      0:00 s6-supervise watchdog
# with a more elaborate pipeline used below indicating that it does not in fact somehow come into effect by `python`

this being in contrast to the other hassio_* containers which are instead/at least under docker-default profile. (by virtue having no specified alternative profile)

What type of installation are you running?

Home Assistant Supervised

affecting supported systems?

I hope y'all take this seriously given the many hours [more than I meant to] I have put into it

but I will put my cards on the table, admitting I might be of spec somewhere (I believe only in an older kernel, but let me know if you see something else):

uname -a and filtered dpkg-query -l ```shell $ uname -a Linux odroidn2 6.1.63-current-meson64 #1 SMP PREEMPT Mon Nov 20 10:52:19 UTC 2023 aarch64 GNU/Linux $ sudo dpkg-query -l | grep -E 'apparmor|cifs-utils|curl|dbus|jq|libglib2.0-bin|lsb-release|network-manager|nfs-common|systemd|udisks2|wget|^linux|docker|homeassistant' hii apparmor 3.0.8-3 arm64 user-space parser utility for AppArmor ii cifs-utils 2:7.0-2 arm64 Common Internet File System utilities ii curl 7.88.1-10+deb12u5 arm64 command line tool for transferring data with URL syntax ii dbus 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (system message bus) ii dbus-bin 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (command line utilities) ii dbus-broker 33-1 arm64 Linux D-Bus Message Broker ii dbus-daemon 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (reference message bus) ii dbus-session-bus-common 1.14.10-1~deb12u1 all simple interprocess messaging system (session bus configuration) ii dbus-system-bus-common 1.14.10-1~deb12u1 all simple interprocess messaging system (system bus configuration) ii dbus-user-session 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (systemd --user integration) ii docker-ce 5:25.0.1-1~debian.12~bookworm arm64 Docker: the open-source application container engine ii docker-ce-cli 5:25.0.1-1~debian.12~bookworm arm64 Docker CLI: the open-source application container engine ii homeassistant-supervised 1.5.0 all Home Assistant Supervised ii jq 1.6-2.1 arm64 lightweight and flexible command-line JSON processor ii libapparmor1:arm64 3.0.8-3 arm64 changehat AppArmor library ii libcurl3-gnutls:arm64 7.88.1-10+deb12u5 arm64 easy-to-use client-side URL transfer library (GnuTLS flavour) ii libcurl4:arm64 7.88.1-10+deb12u5 arm64 easy-to-use client-side URL transfer library (OpenSSL flavour) ii libdbus-1-3:arm64 1.14.10-1~deb12u1 arm64 simple interprocess messaging system (library) ii libglib2.0-bin 2.74.6-2 arm64 Programs for the GLib library ii libjq1:arm64 1.6-2.1 arm64 lightweight and flexible command-line JSON processor - shared library ii libpam-systemd:arm64 252.19-1~deb12u1 arm64 system and service manager - PAM module ii libsystemd-shared:arm64 252.19-1~deb12u1 arm64 systemd shared private library ii libsystemd0:arm64 252.19-1~deb12u1 arm64 systemd utility library ii libudisks2-0:arm64 2.9.4-4 arm64 GObject based library to access udisks2 ii lsb-release 12.0-1 all Linux Standard Base version reporting utility (minimal implementation) ii network-manager 1.42.4-1 arm64 network management framework (daemon and userspace tools) ii network-manager-openvpn 1.10.2-2 arm64 network management framework (OpenVPN plugin core) ii networkd-dispatcher 2.2.3-1 all Dispatcher service for systemd-networkd connection status changes ii nfs-common 1:2.6.2-4 arm64 NFS support files common to client and server ii python3-dbus 1.3.2-4+b1 arm64 simple interprocess messaging system (Python 3 interface) ii systemd 252.19-1~deb12u1 arm64 system and service manager ii systemd-journal-remote 252.19-1~deb12u1 arm64 tools for sending and receiving remote journal logs ii systemd-resolved 252.19-1~deb12u1 arm64 systemd DNS resolver ii systemd-sysv 252.19-1~deb12u1 arm64 system and service manager - SysV compatibility symlinks ii udisks2 2.9.4-4 arm64 D-Bus service to access and manipulate storage devices ii wget 1.21.3-1+b1 arm64 retrieves files from the web ```

Which operating system are you running on?

Debian


Steps to reproduce the issue

  1. Reboot host machine (with a hassio_supervisor container already created. If I stop and delete before reboot, the initial creation is "broken". An additional reboot achieves)

  2. Observer that supervisor comes up reporting as in healthy state;

    odroidn2:~:# uptime
     18:44:13 up 1 min,  1 user,  load average: 1.65, 0.75, 0.28
    odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -2
    24-01-25 18:44:04 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
    24-01-25 18:44:04 INFO (MainThread) [supervisor.host.manager] Host information reload completed
    odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
    healthy: true
    supported: true
    version: 2023.12.1
    version_latest: 2023.12.1

    optionally: proceed in blissful ignorance as to why/how

  3. Alternatively ruin your sense of well-being by looking into the apparmor status of that-there container

    odroidn2:~:# docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
    [apparmor=hassio-supervisor label=disable]
    docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
        xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
        xargs ps Z
    LABEL                               PID TTY      STAT   TIME COMMAND
    unconfined                         2503 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
    unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
    unconfined                         2641 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
    unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
    unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
    unconfined                         2680 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
    unconfined                         2722 ?        S      0:00 s6-supervise supervisor
    unconfined                         2724 ?        S      0:00 s6-supervise watchdog
    unconfined                         2726 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
    unconfined                         2727 ?        Ssl    0:06 python3 -m supervisor
    unconfined                         7424 ?        S      0:00 sleep 300
    • 3.5. Confirm that in fact there is such profile actually

      odroidn2:~:# aa-status | grep hassio
         hassio-supervisor
         hassio-supervisor///usr/bin/gdbus
         hassio-supervisor///usr/bin/git
      odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep apparmor
      24-01-25 19:31:42 INFO (MainThread) [supervisor.host.apparmor] Loading AppArmor Profiles: {'hassio-supervisor'}
      odroidn2:~:# head /usr/share/hassio/apparmor/hassio-supervisor
      #include <tunables/global>
      
      profile hassio-supervisor flags=(attach_disconnected,mediate_deleted) {
        #include <abstractions/base>
        #include <abstractions/python>
      
        network,
        deny network raw,
      
        signal (send) set=(kill,term,int,hup,cont),
    • 3.75. Try various methods to simply restart the container [ha supervisor restart, systemctl restart hassio-supervisor or even systemctl stop hassio-supervisor wait a while systemctl start hassio-supervisor ... and oops actually everything was "fine";

      I kinda swear at least once it broke on restart, but I might be misremembering, or hit one of the edges suggested below)

      odroidn2:~:# ha supervisor restart
      Command completed successfully.
      odroidn2:~:# docker ps | grep  hassio_supervisor
      974b16d01e95   ghcr.io/home-assistant/aarch64-hassio-supervisor:latest     "/init"                  14 minutes ago   Up 13 minutes                                                                                                                                  hassio_supervisor
      odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
      healthy: true
      supported: true
      version: 2023.12.1
      version_latest: 2023.12.1

      still "fine" is all the processes remain unconstrained from the apparmor profile intended for the,

  4. Okay, but anything which actually recreates the container (as /usr/sbin/hassio-supervisor will do when the startup-marker (/run/supervisor/startup-marker) is left in place (so that is the supervisor crashes early?), or indeed as outlined above when the profile is missing initially, or on supervisor update; or if simply someone/something runs docker rm on the stopped container)

    So yeah we can do

    systemctl stop hassio-supervisor && touch /run/supervisor/startup-marker && systemctl start hassio-superviosr

    ... for example

    And now we will find our error state of interest

    odroidn2:~:# docker ps | grep hassio_supervisor
    325c2a447241   ghcr.io/home-assistant/aarch64-hassio-supervisor:latest     "/init"                  About a minute ago   Up About a minute                                                                                                                                  hassio_supervisor
    odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
    healthy: false
    supported: true
    version: 2023.12.1
    version_latest: 2023.12.1
    
    odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
    24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
    24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
    24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged
    
    odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep 'CRITICAL\|ERROR\|WARNING'
    24-01-25 20:07:42 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
    24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.rauc] Host has no rauc support. OTA updates have been disabled.
    24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.agent] No OS-Agent support on the host. Some Host functions have been disabled.
    24-01-25 20:07:52 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!
  5. What has changed? I mean you know, but its

    odroidn2:~:# aa-status | grep hassio-supervisor
       hassio-supervisor
       hassio-supervisor///usr/bin/gdbus
       hassio-supervisor///usr/bin/git
       /package/admin/s6-2.12.0.2/command/s6-svscan (9035) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9077) hassio-supervisor
       /package/admin/s6-linux-init-1.1.2.0/command/s6-linux-init-shutdownd (9079) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9086) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9087) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-ipcserverd (9095) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9135) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9137) hassio-supervisor
       /usr/local/bin/python3.11 (9138) hassio-supervisor
       /usr/bin/bash (9142) hassio-supervisor
       /bin/busybox (9773) hassio-supervisor

    i.e.

    odroidn2:~:# docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
      xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
      xargs ps Z
    LABEL                               PID TTY      STAT   TIME COMMAND
    hassio-supervisor (enforce)        9035 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
    hassio-supervisor (enforce)        9077 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
    hassio-supervisor (enforce)        9079 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
    hassio-supervisor (enforce)        9086 ?        S      0:00 s6-supervise s6rc-fdholder
    hassio-supervisor (enforce)        9087 ?        S      0:00 s6-supervise s6rc-oneshot-runner
    hassio-supervisor (enforce)        9095 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
    hassio-supervisor (enforce)        9135 ?        S      0:00 s6-supervise supervisor
    hassio-supervisor (enforce)        9137 ?        S      0:00 s6-supervise watchdog
    hassio-supervisor (enforce)        9138 ?        Ssl    0:03 python3 -m supervisor
    hassio-supervisor (enforce)        9142 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
    hassio-supervisor (enforce)        9773 ?        S      0:00 sleep 30
  6. We can go deeper actually. You can see most the same thing by starting with debug and debug_wait and using the and attaching from vscode; but to really get there - lets more awkwardly do

    systemctl stop hassio-supervisor
    docker rm hassio_supervisor
    sh <(sed 's/docker container create/docker run --rm -it --entrypoint \/bin\/bash/' /usr/sbin/hassio-supervisor)

    This should (still[^1]) give you into an interactive shell that is set up like the supervisor, including (which running our docker inspect ... pipeline can verify) the apparmor profile.

    [^1]: When I first investigated this on Tuesday (with docker 25.0.0 installed) it was sufficient to do docker exec -it hassio_supervisor bash to repro as below, but something has changed and now that generates a shell sibling to the pid1 as unrestricted instead of also getting the apparmor profile. Bafflingly at one point Tuesday this behavior seemed to change after I ran systemd-analyze set-log-level debug )

    From here its instructive to do 3 things (restarting the shell between or not as you prefer):

    1. actually run the supervisor (e.g. python3 -m supervisor ) and see it fail same as when deamonized (including CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!)
    2. run udevadm monitor, and discover that it works perfectly (prints the headers and will respond if I go e.g. plug and unplug USB devices)
    3. Attempt my minimal repro case, expanded from the real line via the pyudev source by starting a python repl, then

      import pyudev
      import os
      
      from ctypes import get_errno
      
      c = pyudev.Context()
      m = c._libudev.udev_monitor_new_from_netlink(c, b'kernel')
      
      if bool(m):
        print("successfully got pointer to a `struct udev_monitor`")
      else:
        print("udev_monitor_new_from_netlink(3) returned NULL pointer")
        e = get_errno()
        print(f"Errno set was: {os.strerror(e)} ({e})")

      and like the supervisor, you'll get

      udev_monitor_new_from_netlink(3) returned NULL pointer Errno set was: Permission denied (13)

    Why do these things? Well, it reveals that the apparmor policy is in effect AND is killing udev connect for the supervisor (and the python interpreter in general, and most things).

    How do I know that? Because udevadm still works. It is calling the same function (udev_monitor_new_from_netlink) of the same library (from alpine's eudev-libs 3.2.11-r8): https://github.com/eudev-project/eudev/blob/611b6fbae2cca9ceeacb820befb3a0e8caec88b8/src/udev/udevadm-monitor.c#L214

    (Unless in fact pyudev's wheel is bringing along its own copy of e.g. systemd libudev; but I don't see evidence of that)

    So it follows that the only reason that udevadm still works when the other two don't is because the apparmor policy gives it an unrestricted pass:

    /bin/udevadm Ux,

    where python gets no such


Additional information

So what is actually failing?

the libudev call

it's apparently udev_monitor_new_from_netlink(3) which is unexpectedly returning a NULL pointer and setting errno to 13 ("Access denied")

[^2]: Up to writing this section that didn't click and I was looking at systemd-udev's version ; and it seems possible that the mismatch (with the offered udevd) could be in play as the issue. But the code between it (eudev) and the systemd versions of it and the nested device_monitor_new_full, do not differ greatly, though perhaps systemd's version's if (DEBUG_LOGGING) { gated stanza's may end up instructive]

That's pretty much, at core, 3 syscalls:

  1. access on /run/udev/control - which, ls -l shows it srw------- root root and a stat also passed

  2. socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT)

    Indeed I can just now say with some confidence it is this syscall, because I belatedly discovered that pyudev does hook a few log priority functions and

    >>> import syslog
    >>> c.log_priority = syslog.LOG_DEBUG
    >>> m = c._libudev.udev_monitor_new_from_netlink(c, b'kernel')
    error getting socket: Permission denied

    That error appearing to be generated at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L203-L206 :

            if (fd < 0) {
                    udev_monitor->sock = socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT);
                    if (udev_monitor->sock < 0) {
                            log_debug_errno(errno, "error getting socket: %m");
  3. getsockname(udev_monitor->sock, &snl.sa, &addrlen);

So... what's the problem? I don't know. Most of the internet agrees that a net_admin capability ought to clear the way for interfacing with netlink

If I had to guess, I'd point at maybe the:

network, deny network raw,

at https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L7-L8

hitting up against the SOCK_RAW flag on the call ... but again nothing here has changed recently? or has it?

Anyway ... this turned into 8 hours of investigation & writing today on top of the 5 of investigation and 3 of probs useless PR (#4834) on Tuesday; ADHD be like that.

the rest of the standard bug form stuff ### Anything in the Supervisor logs that might be useful for us? ```txt ... CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor! ... System is running in an unhealthy state and needs manual intervention! #after restart ... CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor! ... (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3 24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed 24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete 24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged ``` ### System Health information ## System Information version | core-2024.1.5 -- | -- installation_type | Home Assistant Supervised dev | false hassio | true docker | true user | root virtualenv | false python_version | 3.11.6 os_name | Linux os_version | 6.1.63-current-meson64 arch | aarch64 timezone | America/Denver config_dir | /config
Home Assistant Community Store GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok GitHub API Calls Remaining | 5000 Installed Version | 1.34.0 Stage | running Available Repositories | 1380 Downloaded Repositories | 19
Home Assistant Supervisor host_os | Debian GNU/Linux 12 (bookworm) -- | -- update_channel | stable supervisor_version | supervisor-2023.12.1 agent_version | 1.6.0 docker_version | 25.0.1 disk_total | 915.8 GB disk_used | 10.8 GB healthy | failed to load: Unhealthy supported | true supervisor_api | failed to load: unreachable version_api | ok installed_addons | Studio Code Server (5.15.0), ESPHome (2023.12.8), Mosquitto broker (6.4.0), Tailscale (0.14.0), Music Assistant BETA (2.0.0b84), Zigbee2MQTT (1.35.1-1), Govee to MQTT Bridge (2024.01.21-33e8cd7c)
Dashboards dashboards | 1 -- | -- resources | 12 views | 5 mode | storage
Recorder oldest_recorder_run | January 16, 2024 at 9:45 PM -- | -- current_recorder_run | January 25, 2024 at 7:53 PM estimated_db_size | 290.71 MiB database_engine | sqlite database_version | 3.41.2
### Supervisor diagnostics [config_entry-hassio-230a4ddd53de32307921bbc57f86054b.json.txt](https://github.com/home-assistant/supervisor/files/14060283/config_entry-hassio-230a4ddd53de32307921bbc57f86054b.json.txt)
donaldguy commented 5 months ago

Extracting out this bit of ~speculation to its own comment:

why isn't the policy applied on boot?

I didn't know. It might be a kernel bug, or a docker bug, or a systemd bug. Or it might just be a good old fashioned race condition?

At the highest level these are activating in the right order

odroidn2:~:# systemctl status apparmor | grep active
     Active: active (exited) since Thu 2024-01-25 18:42:39 MST; 19min ago
odroidn2:~:# systemctl status hassio-apparmor | grep active
     Active: active (exited) since Thu 2024-01-25 18:42:40 MST; 19min ago
odroidn2:~:# systemctl status hassio-supervisor | grep active
     Active: active (running) since Thu 2024-01-25 18:42:44 MST; 19min ag

this chain is there:

And I can confirm that after a systemctl stop hassio-supervisor && docker rm -f $(docker ps -aq) && aa-remove-unknown && systemctl start hassio-supervisor, you end up at a-different-than-this-issue supervisor restart loop involving such journal lines as

audit[8358]: AVC apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" hassio-supervisor" pid=15881 comm="runc:[2:INIT]" audit: type=1400 audit(1706235970.127:70): apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" name="hassio-supervisor" pid=15881 comm="runc:[2:INIT]" ... hassio-supervisor[15823]: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: unable to apply apparmor profile: apparmor failed to apply profile: write /proc/self/attr/apparmor/exec: no such file or directory: unknown hassio-supervisor[15823]: Error: failed to start containers: hassio_supervisor

followed by re-pulling the image and trying again fruitlessly in a loop which terminates only when you run systemctl start hassio-apparmor (at which point you end up landing in the profile-applied-to-supervisor-so-it-doesn't work state, until system reboot)

But clearly there is something going (enabalingly) wrong in the application of profile from cold start of an existing hassio_supervisor container

So there may be value (once the profile is not breaking the udev function) in adding to hassio-apparmor.service (and indeed potentially hassos-apparmor.service something like a

ExecStartPre=/usr/bin/aa-enabled -q

and/or indeed fully move the sort of apparmor_parser call which is the heart of these services to be instead a ExecStartPre of the supervisor

voron commented 4 months ago

Having the same issue

24-02-22 15:57:00 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!

Removal of app-armor from supervisor compose file eliminates this error, HA updates work as expected too. Thus I suppose the issue is in app-armor supervisor's profile, as originally reported.

donaldguy commented 4 months ago

specifically, as I sort of said, believe the issue with the policy is,

for some reason now-and-not-before, the interpretation of deny network raw https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L8

now applying also to netlink domain, meaning the

I don't believe just adding an allow network netlink raw will work ... (but maybe it will with the right ordering?)

I think giving it a cut out will require instead saying like...

deny network inet raw,
deny network ax25 raw,
deny network ipx raw,
deny network appletalk raw,
deny network netrom raw,
deny network bridge raw,
deny network atpmvc raw,
deny network x25 raw,
deny network inet6 raw,
deny network rose raw,
deny network netbui raw,
deny network security raw,
deny network key raw,
deny network packet raw,
deny network ash raw,
deny network econet raw,
deny network atmsvc raw,
deny network sna raw,
deny network irda raw,
deny network ppox raw,
deny network wanpipe raw,
deny network bluetooth raw,
deny network unix raw,
deny network rds raw,
deny network llc raw,
deny network can raw,
deny network tipc raw,
deny network iucy raw,
deny network rxrpc raw,
deny network isdn raw,
deny network phonet raw,
deny network ieee802154 raw,
deny network caif raw,
deny network alg raw,
deny network nfc raw,
deny network vsock raw,
deny network mpls raw,
deny network ib raw,

Of course ... the logical question there is ... does it actually need non-raw access to most of those domains? (probs no)

so probs the sweeping [allow] network [] [] [*] of line 7 probs should go

maybe want like(?)

allow network inet tcp,
allow network inet6 tcp,
allow network bluetooth,
allow network netlink,

(maybe bridge? unix?)

github-actions[bot] commented 3 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

agners commented 2 months ago

Thanks for this in depth analyzes!

I was able to reproduce the original symptom (Supervisor getting unhealthy) on Supervised on a Debian 12 Bookworm.

I then was actually wondering if I can reproduce this with HAOS 12.2.rc1, specifically because it comes with new Buildroot which pushes it closer to what Debian Bookworm is using (specifically, Docker 25). However, I was not able to reproduce the problem on HAOS 12.2.rc1.

Which really made me question, what is different between Debian 12 and HAOS 12.2.rc1 :thinking:

Just like in Supervised, the AppArmor profile only gets applied on initial Supervisor container start on HAOS 12.2.rc1 as well. So this part seems to (mis)behave exactly the same way.

HAOS uses AppArmor 3.1.2 already in HAOS 11, no update has taken place there. Debian 12 uses 3.0.8. I looked through the changes between the two versions, but nothing caught my eyes.

So far I am a bit clueless :cry:

I feel it should always have not worked (as in, deny network raw should always lead to deny network netlink raw).

In any case, I think applying an explicit allow protocols as you suggested make sense. We don't need Bluetooth in Supervisor, so I think I'd go with:

network inet  stream,
network inet6 stream,
network inet  dgram,
network inet6 dgram,
network netlink raw,
agners commented 2 months ago

I am still lost as to why Debian behaves different than HAOS here :cry:

A few things to note: Docker exec seems to not apply the AppArmor profile:

hassio-supervisor (enforce)     root       17131  0.2 13.8 192868 136592 ?       Ssl  15:32   0:00 python3 -m supervisor
hassio-supervisor (enforce)     root       17132  0.0  0.3   4160  3516 ?        Ss   15:32   0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce)     root       17151  0.0  0.0   1612     4 ?        S    15:32   0:00 sleep 300
unconfined                      root       17325  0.0  0.4  10008  4488 pts/1    S+   15:33   0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17326  0.0  0.0  10008   480 pts/6    Ss   15:33   0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17327  0.0  2.8 1770644 28048 pts/6   Sl+  15:33   0:00 docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17342  0.0  0.2   2616  2288 pts/0    Ss   15:33   0:00 /bin/bash

Hence my udevadm command always worked, even with the uncorrected profile.

The AppArmor rules processing seems not very well defined to me, especially for networking. The syntax is:

NETWORK RULE = [ QUALIFIERS ] 'network' [ DOMAIN ] [ TYPE | PROTOCOL ]

The two rules so far:

network,
deny network raw,

Again, technically allowed all networking, but then disallowed all raw protocol for all domains, even for netlink.

In any case, raw seems to be disallowed by default, so allowing only netlink explicitly seems to be the way to go. It seems that unix domain is also allowed by default. I did not found where this defaults are specified. But it is probably better to enable it explicitly too, so I end up with:

network unix stream,
network inet stream,
network inet6 stream,
network inet dgram,
network inet6 dgram,
network netlink raw,

I'll create a PR to update the profile, for dev and beta channel first.

RichardPar commented 1 month ago

just to confirm I was bitten by the above deny rule. I changed to what @agners suggested and all is healthy.