Closed nianyush closed 4 months ago
Hello nianyush,
I'm a bot, an experiment of @mudler and @jimmykarily. Thank you for reporting this issue with Kairos version 2.4.5. I can see that the sshd service is failing to start after boot and that the sshd user does not exist in /etc/passwd. This information is helpful in diagnosing the problem.
Please consider providing the following additional details to help us better understand and reproduce the issue:
Once we have all the necessary information, we will be able to better assess the issue and take appropriate action. We appreciate your patience and cooperation.
Best regards, Kairos Bot
This might be related to this: https://github.com/kairos-io/kairos/issues/2492
Missing here is the info that the sshd user did indeed disappeared from the /etc/passwd
@nianyush can you confirm that this is also the same issue you experienced with the fully deleted /etc/passwd/
after a kairos-agent upgrade
?
@mauromorales yes it's exactly the same issue. Only sshd user is lost from /etc/passwd and rest looks fine
encountered this issue again yesterday with kairos v3.0.6 with uki mode. After doing kairos-agent upgrade with a new image and then reboot, i cannot ssh into the vm anymore.
@nianyush thanks for the extra info. To which version did you upgrade to? the same?
@mauromorales yes there is no difference in kairos or os version
@nianyush do you by any chance have this system still online? if so, could you share the mounts?
yes i still have one of the systems. from 3.0.6 with uki
While this was detected becasue of the lack of ssh access, the issue is not related to sshd, but to the fact that a bunch of users are missing from the /etc/passwd. I've renamed the ticket to reflect this.
@nianyush if you still got access to the machines would it be possible to extract the logs from it? Kairos logs, journalctl logs, immucore+stages logs would all be very useful. Especially the 2.4.5 which we have access to the original qcow2 file so we can try to reproduce.
Also, is there any metadata attached to the machine? cdrom/usb with a config drive?
@nianyush also if possible, can you check for any other units breaking in systemd? And is it possible to compare the user list vrs a system that is working correctly, want to validate my previous comment
Managed to reproduce after several runs
root@localhost:/home/kairos# cat /etc/os-release
PRETTY_NAME="Ubuntu 23.10"
NAME="Ubuntu"
VERSION_ID="23.10"
VERSION="23.10 (Mantic Minotaur)"
VERSION_CODENAME=mantic
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=mantic
LOGO=ubuntu-logo
KAIROS_IMAGE_LABEL="23.10-core-amd64-generic-v3.0.7"
KAIROS_ARTIFACT="kairos-ubuntu-23.10-core-amd64-generic-v3.0.7"
KAIROS_FLAVOR="ubuntu"
KAIROS_MODEL="generic"
KAIROS_BUG_REPORT_URL="https://github.com/kairos-io/kairos/issues"
KAIROS_VARIANT="core"
KAIROS_TARGETARCH="amd64"
KAIROS_ID="kairos"
KAIROS_NAME="kairos-core-ubuntu-23.10"
KAIROS_VERSION="v3.0.7"
KAIROS_PRETTY_NAME="kairos-core-ubuntu-23.10 v3.0.7"
KAIROS_IMAGE_REPO="quay.io/kairos/ubuntu:23.10-core-amd64-generic-v3.0.7"
KAIROS_FAMILY="ubuntu"
KAIROS_REGISTRY_AND_ORG="quay.io/kairos"
KAIROS_VERSION_ID="v3.0.7"
KAIROS_FLAVOR_RELEASE="23.10"
KAIROS_RELEASE="v3.0.7"
KAIROS_HOME_URL="https://github.com/kairos-io/kairos"
KAIROS_SOFTWARE_VERSION_PREFIX="k3s"
KAIROS_ID_LIKE="kairos-core-ubuntu-23.10"
KAIROS_GITHUB_REPO="kairos-io/kairos"
root@localhost:/home/kairos# cat /etc/passwd
kairos:x:1000:65538:Created by entities:/home/kairos:/bin/sh
root:x:0:0::/root:/bin/bash
daemon:x:1:1::/usr/sbin:/usr/sbin/nologin
bin:x:2:2::/bin:/usr/sbin/nologin
sys:x:3:3::/dev:/usr/sbin/nologin
sync:x:4:65534::/bin:/bin/sync
games:x:5:60::/usr/games:/usr/sbin/nologin
man:x:6:12::/var/cache/man:/usr/sbin/nologin
lp:x:7:7::/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8::/var/mail:/usr/sbin/nologin
news:x:9:9::/var/spool/news:/usr/sbin/nologin
uucp:x:10:10::/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13::/bin:/usr/sbin/nologin
www-data:x:33:33::/var/www:/usr/sbin/nologin
backup:x:34:34::/var/backups:/usr/sbin/nologin
list:x:38:38::/var/list:/usr/sbin/nologin
irc:x:39:39::/run/ircd:/usr/sbin/nologin
_apt:x:42:65534::/nonexistent:/usr/sbin/nologin
nobody:x:65534:65534::/nonexistent:/usr/sbin/nologin
messagebus:x:109:109:System Message Bus:/:/usr/sbin/nologin
polkitd:x:996:996:polkit:/nonexistent:/usr/sbin/nologin
systemd-network:x:998:998:systemd Network Management:/:/usr/sbin/nologin
systemd-resolve:x:995:995:systemd Resolver:/:/usr/sbin/nologin
systemd-timesync:x:997:997:systemd Time Synchronization:/:/usr/sbin/nologin
root@localhost:/home/kairos# ./yip -a -s initramfs /oem/90_custom.yaml
INFO[0000] yip version v1.6.1-g9484451dac23973ab3cd8a76df42edb2415f7f3e 2024-04-22 13:12:56 UTC
INFO[0000] 1.
INFO[0000] <init> (background: false) (weak: false)
INFO[0000] 2.
INFO[0000] </oem/90_custom.yaml.Create kairos user> (background: false) (weak: true)
INFO[0000] 3.
INFO[0000] </oem/90_custom.yaml.1> (background: false) (weak: true)
INFO[0000] </oem/90_custom.yaml.3> (background: false) (weak: true)
INFO[0000] 4.
INFO[0000] </oem/90_custom.yaml.Create kairos user.1> (background: false) (weak: true)
INFO[0000] </oem/90_custom.yaml.set_inotify_max_values> (background: false) (weak: true)
Looking at the previous yip analysis, it seems like 2 of the user creation get executed in parallel, when normally they should be done in serial (they are all touching the /etc/passwd
file and there's no mutex mechanism as far as I can tell)
I think this somehow comes from having those name: Create kairos user
attributes. On a config with 4 duplicated users but without a name
they all get evaluated as a single step i.e. in serial:
root@localhost:/home/kairos# ./yip -a -s initramfs no-name.yaml
INFO[0000] yip version v1.6.1-g9484451dac23973ab3cd8a76df42edb2415f7f3e 2024-04-22 13:12:56 UTC
INFO[0000] 1.
INFO[0000] <init> (background: false) (weak: false)
INFO[0000] 2.
INFO[0000] <no-name.yaml.0> (background: false) (weak: true)
INFO[0000] 3.
INFO[0000] <no-name.yaml.1> (background: false) (weak: true)
INFO[0000] 4.
INFO[0000] <no-name.yaml.2> (background: false) (weak: true)
INFO[0000] 5.
INFO[0000] <no-name.yaml.3> (background: false) (weak: true)
INFO[0000] 6.
INFO[0000] <no-name.yaml.set_inotify_max_values> (background: false) (weak: true)
I think that the problem comes because of the duplicated name of those user creations, if the names are different, the analysis is similar to the one in the previous comment.
However, because the names are the same, when yip starts adding dependencies, it uses the name as the identifier of the dependency, which when inverting the graph will group them together
see how below we have a ([]herd.GraphEntry) (len=2 cap=2) {
at some point, grouping those 2, which never happens when there are no names
([][]herd.GraphEntry) (len=4 cap=4) {
([]herd.GraphEntry) (len=1 cap=1) {
(herd.GraphEntry) {
WithCallback: (bool) false,
Background: (bool) false,
Callback: ([]func(context.Context) error) <nil>,
Error: (error) <nil>,
Ignored: (bool) false,
Fatal: (bool) false,
WeakDeps: (bool) false,
Executed: (bool) false,
Name: (string) (len=4) "init",
Dependencies: ([]string) <nil>,
WeakDependencies: ([]string) <nil>
}
},
([]herd.GraphEntry) (len=1 cap=1) {
(herd.GraphEntry) {
WithCallback: (bool) true,
Background: (bool) false,
Callback: ([]func(context.Context) error) (len=1 cap=1) {
(func(context.Context) error) 0xadb1c0
},
Error: (error) <nil>,
Ignored: (bool) false,
Fatal: (bool) false,
WeakDeps: (bool) true,
Executed: (bool) false,
Name: (string) (len=42) "/some/yip/01_first.yaml.Create Kairos User",
Dependencies: ([]string) <nil>,
WeakDependencies: ([]string) <nil>
}
},
([]herd.GraphEntry) (len=2 cap=2) {
(herd.GraphEntry) {
WithCallback: (bool) true,
Background: (bool) false,
Callback: ([]func(context.Context) error) (len=1 cap=1) {
(func(context.Context) error) 0xadb1c0
},
Error: (error) <nil>,
Ignored: (bool) false,
Fatal: (bool) false,
WeakDeps: (bool) true,
Executed: (bool) false,
Name: (string) (len=25) "/some/yip/01_first.yaml.1",
Dependencies: ([]string) (len=1 cap=1) {
(string) (len=42) "/some/yip/01_first.yaml.Create Kairos User"
},
WeakDependencies: ([]string) <nil>
},
(herd.GraphEntry) {
WithCallback: (bool) true,
Background: (bool) false,
Callback: ([]func(context.Context) error) (len=1 cap=1) {
(func(context.Context) error) 0xadb1c0
},
Error: (error) <nil>,
Ignored: (bool) false,
Fatal: (bool) false,
WeakDeps: (bool) true,
Executed: (bool) false,
Name: (string) (len=25) "/some/yip/01_first.yaml.3",
Dependencies: ([]string) (len=1 cap=1) {
(string) (len=42) "/some/yip/01_first.yaml.Create Kairos User"
},
WeakDependencies: ([]string) <nil>
}
},
([]herd.GraphEntry) (len=1 cap=1) {
(herd.GraphEntry) {
WithCallback: (bool) true,
Background: (bool) false,
Callback: ([]func(context.Context) error) (len=1 cap=1) {
(func(context.Context) error) 0xadb1c0
},
Error: (error) <nil>,
Ignored: (bool) false,
Fatal: (bool) false,
WeakDeps: (bool) true,
Executed: (bool) false,
Name: (string) (len=44) "/some/yip/01_first.yaml.Create Kairos User.1",
Dependencies: ([]string) (len=1 cap=1) {
(string) (len=25) "/some/yip/01_first.yaml.1"
},
WeakDependencies: ([]string) <nil>
}
}
}
so in the end both Name: (string) (len=25) "/some/yip/01_first.yaml.1",
and Name: (string) (len=25) "/some/yip/01_first.yaml.3",
have (string) (len=42) "/some/yip/01_first.yaml.Create Kairos User"
as a dependency
Glad to hear you can reproduce it! I don't have that 2.4.5 vm anymore :(
@nianyush thanks a lot for opening the issue and providing all the info though, if it wasn't for your screenshots I wouldn't have had an idea where to look
The fix for this issue is now in the latest Kairos release (artifacts building, give it a couple of hours) Kairos: https://github.com/kairos-io/kairos/releases/tag/v3.0.8 Agent: https://github.com/kairos-io/kairos-agent/releases/tag/v2.8.13
@mauromorales @Itxaka facing this issue again with 3.0.10 kairos
@nianyush could you paste the system's /etc/os-release
and payload from the provider?
this is still an issue
the backported version of the agent depends on the right yip version but unfortunately we didn't bump it on immucore backports only on main branch
seems to be fixed in the latest
Kairos version 2.4.5
Seems to be a very corner case for us. I have only seen this once.
Config file that triggered this: