kairos-io / kairos

:penguin: The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
1.12k stars 97 forks source link

Users missing on /etc/passwd #2488

Closed nianyush closed 4 months ago

nianyush commented 6 months ago

Kairos version 2.4.5

Apr 18 18:28:25 localhost systemd[1]: Stopped OpenBSD Secure Shell server.
Apr 18 18:28:25 localhost systemd[1]: Starting OpenBSD Secure Shell server...
Apr 18 18:28:25 localhost sshd[1618]: Privilege separation user sshd does not exist
Apr 18 18:28:25 localhost systemd[1]: ssh.service: Control process exited, code=exited, status=255/EXCEPTION
Apr 18 18:28:25 localhost systemd[1]: ssh.service: Failed with result 'exit-code'.
Apr 18 18:28:25 localhost systemd[1]: Failed to start OpenBSD Secure Shell server.

image

Seems to be a very corner case for us. I have only seen this once.

Config file that triggered this:

#cloud-config

cosign: false
install:
    auto: true
    device: auto
    grub-entry-name: Palette eXtended Kubernetes Edge
    grub_options:
        saved_entry: registration
    passive:
        size: 8192
    poweroff: true
    reboot: false
    recovery-system:
        size: 10000
    system:
        size: 8192
reset:
    grub-entry-name: Palette eXtended Kubernetes Edge
    system:
        size: 8192
stages:
    after-upgrade:
        - commands:
            - mkdir -p /usr/local/bin
            - '[ -L /usr/local/bin/agent-provider-stylus ] || ln -s /opt/spectrocloud/bin/agent-provider-stylus /usr/local/bin/agent-provider-stylus'
            - '[ -L /usr/local/bin/palette-tui ] || ln -s /opt/spectrocloud/bin/palette-tui /usr/local/bin/palette-tui'
            - bash /opt/spectrocloud/scripts/content.sh
          name: Execute after upgrade commands
        - commands:
            - grub2-editenv /oem/grubenv unset saved_entry
          if: '[ -f /oem/grubenv ]'
          name: Unset registration grubenv
    before-install:
        - commands:
            - echo -n > /etc/machine-id
          if: '[ -f "/run/cos/live_mode" ]'
          name: Remove machine-id
        - commands:
            - echo -n > /etc/machine-id
          if: '[ -f "/run/cos/live_mode" ]'
          name: Remove machine-id
    initramfs:
        - name: Create kairos user
          users:
            kairos:
                groups:
                    - sudo
                passwd: kairos
        - users:
            kairos:
                groups:
                    - sudo
                passwd: kairos
        - name: Create kairos user
          users:
            kairos:
                groups:
                    - sudo
                passwd: kairos
        - users:
            kairos:
                groups:
                    - sudo
                passwd: kairos
        - if: '[ ! -f /oem/80_stylus.yaml ]'
          name: set_inotify_max_values
          sysctl:
            fs.inotify.max_user_instances: "8192"
            fs.inotify.max_user_watches: "524288"
stylus:
    includeTui: false
    installationMode: airgap
    skipStylusUpgrade: true
upgrade:
    grub-entry-name: Palette eXtended Kubernetes Edge
    passive:
        size: 8192
    recovery-system:
        size: 8192
    system:
        size: 8192
verify: false
ci-robbot commented 6 months ago

Hello nianyush,

I'm a bot, an experiment of @mudler and @jimmykarily. Thank you for reporting this issue with Kairos version 2.4.5. I can see that the sshd service is failing to start after boot and that the sshd user does not exist in /etc/passwd. This information is helpful in diagnosing the problem.

Please consider providing the following additional details to help us better understand and reproduce the issue:

  1. Are there any relevant logs or configuration files that could help us understand the issue better?
  2. Have there been any recent updates or changes made to your system that could have caused this issue?

Once we have all the necessary information, we will be able to better assess the issue and take appropriate action. We appreciate your patience and cooperation.

Best regards, Kairos Bot

jimmykarily commented 6 months ago

This might be related to this: https://github.com/kairos-io/kairos/issues/2492

Itxaka commented 6 months ago

Missing here is the info that the sshd user did indeed disappeared from the /etc/passwd

mauromorales commented 6 months ago

@nianyush can you confirm that this is also the same issue you experienced with the fully deleted /etc/passwd/ after a kairos-agent upgrade?

nianyush commented 6 months ago

@mauromorales yes it's exactly the same issue. Only sshd user is lost from /etc/passwd and rest looks fine

nianyush commented 6 months ago

encountered this issue again yesterday with kairos v3.0.6 with uki mode. After doing kairos-agent upgrade with a new image and then reboot, i cannot ssh into the vm anymore.
image image image

mauromorales commented 6 months ago

@nianyush thanks for the extra info. To which version did you upgrade to? the same?

nianyush commented 6 months ago

@mauromorales yes there is no difference in kairos or os version

mauromorales commented 6 months ago

@nianyush do you by any chance have this system still online? if so, could you share the mounts?

nianyush commented 6 months ago

yes i still have one of the systems. from 3.0.6 with uki image

mauromorales commented 6 months ago

While this was detected becasue of the lack of ssh access, the issue is not related to sshd, but to the fact that a bunch of users are missing from the /etc/passwd. I've renamed the ticket to reflect this.

Itxaka commented 6 months ago

@nianyush if you still got access to the machines would it be possible to extract the logs from it? Kairos logs, journalctl logs, immucore+stages logs would all be very useful. Especially the 2.4.5 which we have access to the original qcow2 file so we can try to reproduce.

Also, is there any metadata attached to the machine? cdrom/usb with a config drive?

mauromorales commented 6 months ago

@nianyush also if possible, can you check for any other units breaking in systemd? And is it possible to compare the user list vrs a system that is working correctly, want to validate my previous comment

mauromorales commented 6 months ago

Managed to reproduce after several runs

root@localhost:/home/kairos# cat /etc/os-release
PRETTY_NAME="Ubuntu 23.10"
NAME="Ubuntu"
VERSION_ID="23.10"
VERSION="23.10 (Mantic Minotaur)"
VERSION_CODENAME=mantic
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=mantic
LOGO=ubuntu-logo
KAIROS_IMAGE_LABEL="23.10-core-amd64-generic-v3.0.7"
KAIROS_ARTIFACT="kairos-ubuntu-23.10-core-amd64-generic-v3.0.7"
KAIROS_FLAVOR="ubuntu"
KAIROS_MODEL="generic"
KAIROS_BUG_REPORT_URL="https://github.com/kairos-io/kairos/issues"
KAIROS_VARIANT="core"
KAIROS_TARGETARCH="amd64"
KAIROS_ID="kairos"
KAIROS_NAME="kairos-core-ubuntu-23.10"
KAIROS_VERSION="v3.0.7"
KAIROS_PRETTY_NAME="kairos-core-ubuntu-23.10 v3.0.7"
KAIROS_IMAGE_REPO="quay.io/kairos/ubuntu:23.10-core-amd64-generic-v3.0.7"
KAIROS_FAMILY="ubuntu"
KAIROS_REGISTRY_AND_ORG="quay.io/kairos"
KAIROS_VERSION_ID="v3.0.7"
KAIROS_FLAVOR_RELEASE="23.10"
KAIROS_RELEASE="v3.0.7"
KAIROS_HOME_URL="https://github.com/kairos-io/kairos"
KAIROS_SOFTWARE_VERSION_PREFIX="k3s"
KAIROS_ID_LIKE="kairos-core-ubuntu-23.10"
KAIROS_GITHUB_REPO="kairos-io/kairos"
root@localhost:/home/kairos# cat /etc/passwd
kairos:x:1000:65538:Created by entities:/home/kairos:/bin/sh
root:x:0:0::/root:/bin/bash
daemon:x:1:1::/usr/sbin:/usr/sbin/nologin
bin:x:2:2::/bin:/usr/sbin/nologin
sys:x:3:3::/dev:/usr/sbin/nologin
sync:x:4:65534::/bin:/bin/sync
games:x:5:60::/usr/games:/usr/sbin/nologin
man:x:6:12::/var/cache/man:/usr/sbin/nologin
lp:x:7:7::/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8::/var/mail:/usr/sbin/nologin
news:x:9:9::/var/spool/news:/usr/sbin/nologin
uucp:x:10:10::/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13::/bin:/usr/sbin/nologin
www-data:x:33:33::/var/www:/usr/sbin/nologin
backup:x:34:34::/var/backups:/usr/sbin/nologin
list:x:38:38::/var/list:/usr/sbin/nologin
irc:x:39:39::/run/ircd:/usr/sbin/nologin
_apt:x:42:65534::/nonexistent:/usr/sbin/nologin
nobody:x:65534:65534::/nonexistent:/usr/sbin/nologin
messagebus:x:109:109:System Message Bus:/:/usr/sbin/nologin
polkitd:x:996:996:polkit:/nonexistent:/usr/sbin/nologin
systemd-network:x:998:998:systemd Network Management:/:/usr/sbin/nologin
systemd-resolve:x:995:995:systemd Resolver:/:/usr/sbin/nologin
systemd-timesync:x:997:997:systemd Time Synchronization:/:/usr/sbin/nologin
root@localhost:/home/kairos# ./yip -a -s initramfs /oem/90_custom.yaml
INFO[0000] yip version v1.6.1-g9484451dac23973ab3cd8a76df42edb2415f7f3e 2024-04-22 13:12:56 UTC
INFO[0000] 1.
INFO[0000]  <init> (background: false) (weak: false)
INFO[0000] 2.
INFO[0000]  </oem/90_custom.yaml.Create kairos user> (background: false) (weak: true)
INFO[0000] 3.
INFO[0000]  </oem/90_custom.yaml.1> (background: false) (weak: true)
INFO[0000]  </oem/90_custom.yaml.3> (background: false) (weak: true)
INFO[0000] 4.
INFO[0000]  </oem/90_custom.yaml.Create kairos user.1> (background: false) (weak: true)
INFO[0000]  </oem/90_custom.yaml.set_inotify_max_values> (background: false) (weak: true)
mauromorales commented 6 months ago

Looking at the previous yip analysis, it seems like 2 of the user creation get executed in parallel, when normally they should be done in serial (they are all touching the /etc/passwd file and there's no mutex mechanism as far as I can tell)

I think this somehow comes from having those name: Create kairos user attributes. On a config with 4 duplicated users but without a name they all get evaluated as a single step i.e. in serial:

root@localhost:/home/kairos# ./yip -a -s initramfs no-name.yaml
INFO[0000] yip version v1.6.1-g9484451dac23973ab3cd8a76df42edb2415f7f3e 2024-04-22 13:12:56 UTC
INFO[0000] 1.
INFO[0000]  <init> (background: false) (weak: false)
INFO[0000] 2.
INFO[0000]  <no-name.yaml.0> (background: false) (weak: true)
INFO[0000] 3.
INFO[0000]  <no-name.yaml.1> (background: false) (weak: true)
INFO[0000] 4.
INFO[0000]  <no-name.yaml.2> (background: false) (weak: true)
INFO[0000] 5.
INFO[0000]  <no-name.yaml.3> (background: false) (weak: true)
INFO[0000] 6.
INFO[0000]  <no-name.yaml.set_inotify_max_values> (background: false) (weak: true)
mauromorales commented 6 months ago

I think that the problem comes because of the duplicated name of those user creations, if the names are different, the analysis is similar to the one in the previous comment.

However, because the names are the same, when yip starts adding dependencies, it uses the name as the identifier of the dependency, which when inverting the graph will group them together

see how below we have a ([]herd.GraphEntry) (len=2 cap=2) { at some point, grouping those 2, which never happens when there are no names

([][]herd.GraphEntry) (len=4 cap=4) {
 ([]herd.GraphEntry) (len=1 cap=1) {
  (herd.GraphEntry) {
   WithCallback: (bool) false,
   Background: (bool) false,
   Callback: ([]func(context.Context) error) <nil>,
   Error: (error) <nil>,
   Ignored: (bool) false,
   Fatal: (bool) false,
   WeakDeps: (bool) false,
   Executed: (bool) false,
   Name: (string) (len=4) "init",
   Dependencies: ([]string) <nil>,
   WeakDependencies: ([]string) <nil>
  }
 },
 ([]herd.GraphEntry) (len=1 cap=1) {
  (herd.GraphEntry) {
   WithCallback: (bool) true,
   Background: (bool) false,
   Callback: ([]func(context.Context) error) (len=1 cap=1) {
    (func(context.Context) error) 0xadb1c0
   },
   Error: (error) <nil>,
   Ignored: (bool) false,
   Fatal: (bool) false,
   WeakDeps: (bool) true,
   Executed: (bool) false,
   Name: (string) (len=42) "/some/yip/01_first.yaml.Create Kairos User",
   Dependencies: ([]string) <nil>,
   WeakDependencies: ([]string) <nil>
  }
 },
 ([]herd.GraphEntry) (len=2 cap=2) {
  (herd.GraphEntry) {
   WithCallback: (bool) true,
   Background: (bool) false,
   Callback: ([]func(context.Context) error) (len=1 cap=1) {
    (func(context.Context) error) 0xadb1c0
   },
   Error: (error) <nil>,
   Ignored: (bool) false,
   Fatal: (bool) false,
   WeakDeps: (bool) true,
   Executed: (bool) false,
   Name: (string) (len=25) "/some/yip/01_first.yaml.1",
   Dependencies: ([]string) (len=1 cap=1) {
    (string) (len=42) "/some/yip/01_first.yaml.Create Kairos User"
   },
   WeakDependencies: ([]string) <nil>
  },
  (herd.GraphEntry) {
   WithCallback: (bool) true,
   Background: (bool) false,
   Callback: ([]func(context.Context) error) (len=1 cap=1) {
    (func(context.Context) error) 0xadb1c0
   },
   Error: (error) <nil>,
   Ignored: (bool) false,
   Fatal: (bool) false,
   WeakDeps: (bool) true,
   Executed: (bool) false,
   Name: (string) (len=25) "/some/yip/01_first.yaml.3",
   Dependencies: ([]string) (len=1 cap=1) {
    (string) (len=42) "/some/yip/01_first.yaml.Create Kairos User"
   },
   WeakDependencies: ([]string) <nil>
  }
 },
 ([]herd.GraphEntry) (len=1 cap=1) {
  (herd.GraphEntry) {
   WithCallback: (bool) true,
   Background: (bool) false,
   Callback: ([]func(context.Context) error) (len=1 cap=1) {
    (func(context.Context) error) 0xadb1c0
   },
   Error: (error) <nil>,
   Ignored: (bool) false,
   Fatal: (bool) false,
   WeakDeps: (bool) true,
   Executed: (bool) false,
   Name: (string) (len=44) "/some/yip/01_first.yaml.Create Kairos User.1",
   Dependencies: ([]string) (len=1 cap=1) {
    (string) (len=25) "/some/yip/01_first.yaml.1"
   },
   WeakDependencies: ([]string) <nil>
  }
 }
}

so in the end both Name: (string) (len=25) "/some/yip/01_first.yaml.1", and Name: (string) (len=25) "/some/yip/01_first.yaml.3", have (string) (len=42) "/some/yip/01_first.yaml.Create Kairos User" as a dependency

nianyush commented 6 months ago

Glad to hear you can reproduce it! I don't have that 2.4.5 vm anymore :(

mauromorales commented 6 months ago

@nianyush thanks a lot for opening the issue and providing all the info though, if it wasn't for your screenshots I wouldn't have had an idea where to look

mauromorales commented 6 months ago

The fix for this issue is now in the latest Kairos release (artifacts building, give it a couple of hours) Kairos: https://github.com/kairos-io/kairos/releases/tag/v3.0.8 Agent: https://github.com/kairos-io/kairos-agent/releases/tag/v2.8.13

nianyush commented 5 months ago

@mauromorales @Itxaka facing this issue again with 3.0.10 kairos

mauromorales commented 5 months ago

@nianyush could you paste the system's /etc/os-release and payload from the provider?

mudler commented 5 months ago

this is still an issue

mauromorales commented 5 months ago

the backported version of the agent depends on the right yip version but unfortunately we didn't bump it on immucore backports only on main branch

mauromorales commented 4 months ago

seems to be fixed in the latest