kairos-io / kairos

:penguin: The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
1.08k stars 93 forks source link

Reset of system including oem fails #2836

Closed mauromorales closed 2 days ago

mauromorales commented 2 weeks ago

alpine v3.1.2-rc1 on rpi4

the command sudo kairos-agent reset --reboot --reset-oem results in the following error

Screenshot 2024-08-29 at 21 33 02
mauromorales commented 2 weeks ago

Also happens on Ubuntu 24.04 amd64

jimmykarily commented 1 week ago

@mudler what is the use case for --reset-oem?

Itxaka commented 4 days ago

seems bad. Looks like we are not unmounting the partitions before formatting. Problem is that the mountpoints are not correct and we are setting some defaults ones over them so... the unmounting wont work.

Itxaka commented 3 days ago

Indeed, something weird that we havent hit, I guess because nobody uses this option?

It cant get its mountpoint for some reason

2024-09-11T09:30:05Z INF &v1.Partition{
  Name: "vda2",
  FilesystemLabel: "COS_OEM",
  Size: 64,
  FS: "ext4",
  Flags: nil,
  MountPoint: "",
  Path: "/dev/vda2",
  Disk: "/dev/vda",
}
[root@cos-recovery ~]# mount |grep vda2
/dev/vda2 on /oem type ext4 (rw,relatime)

Seems like becuase we mount it by label:

/dev/disk/by-label/COS_OEM /oem ext4 rw,relatime 0 0

ghw cant find it as it looks at the /sys/block/DISK/PARTITION to get the mount info and compare it against /proc/mounts so its finding /dev/vda2 but it doesnt appear on /proc/mounts with that name.

This one is a difficult one.....

Itxaka commented 3 days ago

yeah big bug here, if immucore mounts partitions by label ghw wont find them.

At this point with all the fixes around the partition and transformations that we do, it might be a sane thing to just redo the getallpartitions in our side borrowing code from ghw directly and setting it into the sdk so we have the fixes for eeverything that kairos uses.

We have the transform to our partition, we have the check for lvm to find partitions, and now a fix to get them by label....

richardelling commented 2 days ago

Doing name comparisons for cross-referencing /proc/mounts is not correct. You'll need to get the device from /proc/mounts, lookup the major:minor number and then look for matching /dev names. It is not unusual to see 5-10 links in /dev that point to the same major:minor numbered device. Remember, most of the things in /dev are actually symlinks.

More detail:

  1. follow the link
  2. syscall.Stat() returns the Rdev which is the concatenated major:minor number for devices

There are probably a few other ways to do this, but at the end of the day you need to be able to compare against anything that gets to the kernel device interface.

Itxaka commented 2 days ago

Doing name comparisons for cross-referencing /proc/mounts is not correct. You'll need to get the device from /proc/mounts, lookup the major:minor number and then look for matching /dev names. It is not unusual to see 5-10 links in /dev that point to the same major:minor numbered device. Remember, most of the things in /dev are actually symlinks.

More detail:

  1. follow the link
  2. syscall.Stat() returns the Rdev which is the concatenated major:minor number for devices

There are probably a few other ways to do this, but at the end of the day you need to be able to compare against anything that gets to the kernel device interface.

Yeah the problem is that it seems that during recovery in order to find the partition we use the device incoming from the ghw library but we mount with the label.

Now we have extracted the parts of the that affects us to make this code simpler and getting the partitions simpler and that opens the door of returning a device with a list of aliases extracted from the udev database with the major: minor.

Basically we worked with what the lib gave us which restricted what we can do, but in the short future we might be able to get all the info we might need instead of having to try and fail and retry :D

richardelling commented 1 day ago

yeah, the problem with libraries/modules like ghw is that they become least common denominators or get massively OS-specific, thus losing the point of their existence.

The problem with udev is that it isn't a good source of truth... it can and does get out of sync with reality. And udev misses a lot of useful information. And udev can get wedged by misbehaving hardware or firmware. SysFS and DevFS are much better sources of truth.

So should we try to solve this problem here or at ghw?