oe-alliance / oe-alliance-core

The openembedded alliance core.
GNU General Public License v2.0
163 stars 181 forks source link

VU+ kexec and kernel updates #877

Open WanWizard opened 2 months ago

WanWizard commented 2 months ago

There've been long standing complaints that under certain conditions, the VU+ kexec multiboot isn't stable, which may lead to a broken kexec system (so the slot 0 image boots) or a non-booting box.

After suffering from this problem last night, I've decided to look into it.

The root cause seems to be that the postinst of the kernel-image package does a hardcoded dd into the kernel partition, which overwrites the kernel of the slot 0 image, not the one of the running image.

This can be addressed in the BSP, by using something like findkerneldevice.sh, like other brands do, but that only fixes it for newly build images, not for all those images already out there. You could show a warning in enigma when kernel-image-* is amongst the packages being updated, but again that only addresses it for newly build images.

Since this is an issue for all image makers, I'm interested in your thoughts on things, so we can come up with a common solution (if any).

TwolDE2 commented 2 months ago

So this is a failure after a software update?

Huevos commented 2 months ago

So, when is kernel-image-* being updated? The source hasn't changed in years so why is the package being updated on interim builds?

dpeddi commented 2 months ago

At the beginning I've considered this possible issue. Since the kexec multiboot should be distribution agnostic, changing the kernel-post-install script from my point of view isn't a nice idea, so I'veconsidered a mount -o bind real-kexec-img /dev/mmcblk0pxx I was sure to have implemented that but can't find it on the GitHub repository of kexec-multiboot scripts. At the moment I'm busy, you can try to implement it and pull request. The scripts are on https://github.com/BlackHole/kexec-multiboot/tree/main/recipes-core/initrdscripts/files

WanWizard commented 2 months ago

So, when is kernel-image-* being updated? The source hasn't changed in years so why is the package being updated on interim builds?

The OpenPLi kirkstone build wrecked a lot of boxes because kernel-image was amongst the updates, and this happened after the BSP was changed because code.vuplus.com was taken offline.

I agree that under normal circumstances it shouldn't be rebuild, but bitbake moves in mysterious ways...

jbleyel commented 2 months ago

I‘m a little bit confused because openPli is using it’s own build system. So why do you create an issue here? Or do I miss something?

WanWizard commented 2 months ago

Or do I miss something?

The offending code is present in the BSP of all images, including OE-A. So the same will happen if you would for example update OpenATV in a slot and it has a kernel update.

I posted this here to get consensus on a solution, we're all in the same boat.

WanWizard commented 2 months ago

The postinst probably needs to be something like

pkg_postinst_kernel-image () {
    if [ -d /proc/stb ] ; then
        DEST="/dev/${MTD_KERNEL}"
        if [ -f /proc/cmdline -a -s /proc/cmdline ]; then
            args=`cat /proc/cmdline`
            for line in ${args};
            do
                key=${line%%=*}
                value=${line#*=}
                if [ "$key" == "kernel" ]; then
                    DEST="$value"
                    break
                fi
            done
        fi
        echo Kernel is located at ${DEST}

        if [ -b "${DEST}" ]; then
            dd if=/${KERNEL_IMAGEDEST}/${KERNEL_IMAGETYPE} of=/dev/${DEST}
        elif [ -f "/boot/${DEST}" ]; then
            cp -f /${KERNEL_IMAGEDEST}/${KERNEL_IMAGETYPE} /boot/${DEST}
        else
            echo "Can't determine type of ${DEST}!"
        fi
    fi
    rm -f /${KERNEL_IMAGEDEST}/${KERNEL_IMAGETYPE}
    true
}

The test for /proc/cmdline is probably not needed (given the earlier test for /proc/stb, but I've seen that after the issue occured and the box has rebooted into slot 0, /proc/cmdline doesn't exist, or exists but is 0 bytes.

WanWizard commented 2 months ago

@dpeddi The problem is that the kexec kernel that was installed in the kernel partiion is overwritten by the postinst of the kernel of the kernel-image package being installed in one of the slots.

This can be fixed by making the postinst kexec aware (see above), but that doesn't fix it for older images.

As the kexec kernel has been wiped, I'm not sure if this could be addressed from within the kexec scripts, as they won't run anymore.

Since /usr/bin/kernel_auto.bin is still present on the box, it might be possible to revive the broken box simply by doing

dd if=/usr/bin/kernel_auto.bin of=/dev/${MTD_KERNEL}

(after determining what the MTD KERNEL device should be).

dpeddi commented 2 months ago

Or much easier, flash the kexec version of kernel_auto.bin with an usb stick to revive the box.

WanWizard commented 2 months ago

Or much easier, flash the kexec version of kernel_auto.bin with an usb stick to revive the box.

That does the same as my manual dd command, no?

Is that simply copying kernel_auto.bin from /usr/bin to /vuplus/ on the USB stick?

Given the fact we will never be able to address this retroactively (for images already build), we need to have some procedure ready for users having this problem, so they don't start with a standard USB flash again, and wipe out all multiboot slots...

WanWizard commented 2 months ago

Appearently nobody here sees this as an issue that needs addressing?

TwolDE2 commented 2 months ago

Dpeddi ( the originator of the kexec kernel) is on holiday, so better to wait for his return. The issue is there, but has not caused any reported problems that I am aware of on the OE-A images.

dpeddi commented 2 months ago

Will give a look within a week or two

WanWizard commented 2 months ago

The issue is there, but has not caused any reported problems that I am aware of on the OE-A images.

Within days we've had several people with this issue.

It only occurs if the kernel-image package is in the updates. This happened for all when the VU+ recipes were altered due to code.vuplus.com going down, it also happens sometimes when changes are made to the BSP during development, and it happens in OpenPLi when people have installed a release candidate, and do a software update after the version is released (which upgrades the RC to the release version).

Huevos commented 2 months ago

Ok, so I will ask again. Why is the package being updated when there is no change. I just checked our previous image version and the package name is identical. So why is the package name changing on PLi?

I know this is not the answer to the problem, but is the reason we don't see it.

WanWizard commented 2 months ago

The name doesn't change, the PR does after a new build, so opkg sees it as an update.

I agree there should not be anything to update, afaik none of the VU+ BSP changes has an influence on the kernel build. But bitbake decided to rebuild the kernel. Which gave the kernel-image package a new PR.

Come to think of it, chances are you won't really encounter this problem in OE-A images (under normal circumstances) as you don't use a PR server, so even if the package is built again, it will have the same PR, so opkg will not see it as an update.

dpeddi commented 2 months ago

The post-install could fix your latest image, so it could be ok... But if someone would install an old openpli then update, they will overwrite the kexec-kernel

Probably should be implemented both the post-install and some additional protection with mount -o bind /slot/kernel.img /dev/mmcblk0x in the kexec initrd

WanWizard commented 2 months ago

The kernel was rebuild because the kernel source was also downloaded from code.vuplus.com, so that was a SRC_URI that needed changing in the BSP. Again, OE-A didn't have that problem either, because everything is forked and copied to some OE-A specific location.

WanWizard commented 2 months ago

Probably should be implemented both the post-install and some additional protection with mount -o bind /slot/kernel.img /dev/mmcblk0x in the kexec initrd

That is beyond me I'm afraid.

WanWizard commented 2 months ago

@Huevos to be specific: OE-A images don't have an issue in this specific case (because the dependency on code.vuplus.com wasn't there), but will have the same problem as soon as there is a reason to manually bump the PR of the linux kernel recipe.

Huevos commented 2 months ago

@Huevos to be specific: OE-A images don't have an issue in this specific case (because the dependency on code.vuplus.com wasn't there), but will have the same problem as soon as there is a reason to manually bump the PR of the linux kernel recipe.

Yes it was, we only changed that SRC_URI a few weeks ago, but not the PR because the code is identical.

WanWizard commented 2 months ago

Like I said, we use a PR server, so we don't control the PR, bitbake does. And it bumped the PR when the SRC_URI changed.

And for images that use a hardcoded PR, they also will have the issue when they need to bump the PR. Meaning that altough it wasn't an issue this time, it may become one in the future, so imho it is worth thinking about it, and not ignore it.

Huevos commented 2 months ago

We will update it when @dpeddi is back from holiday.

dpeddi commented 2 months ago

@WanWizard Hello, I've attempted to implement the override of the kernel device using initrd but it can't work.

The guest image remount /dev so the overriden device become invisible.

So the solution you propose is the best available. I think we will include it in oe-a with credits to you.

WanWizard commented 2 months ago

And something in the initrd of slot 0 that can detect the wrong kernel has booted?

Because when the issue happens, half the filesystem is missing (like /sys, completely empty).

Also, my postinst suggestion has been written from the top of my head, not tested, so please double-check it.

dpeddi commented 2 months ago

Multiboot kernel consists of:

If the guest would flash the kernel it write a non kexec-kernel.

Without the kexec kernel no initrd is called so we can't implement what you are asking

So the next reboot it would start a kernel that could be misaligned by the kernel modules and the filesystem of the "recovery" image.

However what you describe is a bit strange. Usually no kernel modules are needed during normal boot to get all the file systems mounted and ethernet connectivity. On which box did this mount issue happened? Which recovery image was used? Which guest image?

WanWizard commented 2 months ago

My idea was that after the box is flashed, the user opts for installation on multiboot from Enigma.

This installs the kexec kernel, and my suggestion was to also include something in the kexec package, to be installed in /etc/init.d/ in slot 0, that can detect that slot 0 was booted because the kexec kernel in flash was overwritten.

I had OpenPLi develop in slot 0, but people with other images in slot 0 have reported the same, if the kexec kernel is overwritten due to a kernel update in a multiboot slot, the box reboots in slot 0, with the kernel that was written to flash in the update, but /sys was empty (which means that for example bootargs can't be read).

I agree with you that apart of some differences in kernel defconfig, all kernel images should be the same as VU+ has never updated one, so I can't explain what is going on. I can only report what I've seen myself, and what others reported: after the kexec kernel is overwritten, the box boots slot 0, enigma starts, but crashes when you start using it.

dpeddi commented 2 months ago

My idea was that after the box is flashed, the user opts for installation on multiboot from Enigma.
More or less all oe-a image should have an option to install kexec-multiboot after flashing.

This installs the kexec kernel, and my suggestion was to also include something in the kexec package, to be installed in /etc/init.d/ in slot 0, that can detect that slot 0 was booted because the kexec kernel in flash was overwritten. Nice idea but, if something went wrong with boot we don't have access to framebuffer and frontpanel so we can't display anything except in some logs.

@Huevos, what do you think?

WanWizard commented 2 months ago

True. I was thinking that, as the image in slot 0 is unusable anyway, when it issue is detected, simply dd the kexec into flash again ( which should still be in /usr/lib ), and reboot the box, which should fix it and boot the original slot again?

TwolDE2 commented 2 months ago

@WanWizard - so do you know where the box crashes? If its in E2 its probably manageable as all the components are available to fix it. e.g. You know the last flashed slot and both the kernel and kexec kernel are available

TwolDE2 commented 2 months ago

To get debug, I guess I can add code in slot 0 to write the original kernel to disk and then reboot to say slot 1 and see what happens.

@WanWizard - would that test setup match the issue as you see it?

WanWizard commented 2 months ago

@WanWizard - so do you know where the box crashes? If its in E2 its probably manageable as all the components are available to fix it. e.g. You know the last flashed slot and both the kernel and kexec kernel are available

Several locations, but the only one I have available from a user report is

Traceback (most recent call last):
  File "/usr/lib/enigma2/python/Components/ActionMap.py", line 77, in action
  File "/usr/lib/enigma2/python/Components/ActionMap.py", line 57, in action
  File "/usr/lib/enigma2/python/Screens/Menu.py", line 56, in okbuttonClick
  File "/usr/lib/enigma2/python/Tools/BoundFunction.py", line 10, in call
  File "/usr/lib/enigma2/python/Screens/Menu.py", line 69, in runScreen
  File "/usr/lib/enigma2/python/Screens/Menu.py", line 75, in openDialog
  File "/usr/lib/enigma2/python/StartEnigma.py", line 295, in openWithCallback
    dlg = self.open(screen, arguments, *kwargs)
  File "/usr/lib/enigma2/python/StartEnigma.py", line 305, in open
    dlg = self.current_dialog = self.instantiateDialog(screen, arguments, *kwargs)
  File "/usr/lib/enigma2/python/StartEnigma.py", line 248, in instantiateDialog
    return self.doInstantiateDialog(screen, arguments, kwargs, self.desktop)
  File "/usr/lib/enigma2/python/StartEnigma.py", line 265, in doInstantiateDialog
    dlg = screen(self, arguments, *kwargs)
  File "/usr/lib/enigma2/python/Screens/FlashImage.py", line 513, in init
  File "/usr/lib/enigma2/python/Tools/Multiboot.py", line 83, in getCurrentImage
FileNotFoundError: [Errno 2] No such file or directory: '/sys/firmware/devicetree/base/chosen/bootargs'
[ePyObject] (CallObject(<bound method NumberActionMap.action of <Components.ActionMap.NumberActionMap object at 0xac5ee700>>,('OkCancelActions', 'ok')) failed)

Which happens on an OpenPLi image when you to into the multiboot selection screen. Which is triggered because /sys is completely empty.

WanWizard commented 2 months ago

To get debug, I guess I can add code in slot 0 to write the original kernel to disk and then reboot to say slot 1 and see what happens.

I don't think this complexity is needed. When the issue occurs, slot 0 is always booted (as writing the guest kernel to flash effectively wipes out multiboot), and the kexec kernel file is still available:

bash-5.1# pwd
/boot/usr/bin
bash-5.1# ls -l | grep kern
-rwxr-xr-x    1 root     root       6668928 Jan 26 17:56 kernel_auto.bin

so it could be dd'd back into flash, reboot, and the box will start it's original slot again.

( see the user instructions I wrote: https://wiki.openpli.org/Vu_Multiboot#Multiboot_images_missing_after_an_update.3F )

dpeddi commented 2 months ago

@WanWizard - so do you know where the box crashes? If its in E2 its probably manageable as all the components are available to fix it. e.g. You know the last flashed slot and both the kernel and kexec kernel are available

Several locations, but the only one I have available from a user report is

Traceback (most recent call last):
  File "/usr/lib/enigma2/python/Components/ActionMap.py", line 77, in action
  File "/usr/lib/enigma2/python/Components/ActionMap.py", line 57, in action
  File "/usr/lib/enigma2/python/Screens/Menu.py", line 56, in okbuttonClick
  File "/usr/lib/enigma2/python/Tools/BoundFunction.py", line 10, in call
  File "/usr/lib/enigma2/python/Screens/Menu.py", line 69, in runScreen
  File "/usr/lib/enigma2/python/Screens/Menu.py", line 75, in openDialog
  File "/usr/lib/enigma2/python/StartEnigma.py", line 295, in openWithCallback
    dlg = self.open(screen, arguments, *kwargs)
  File "/usr/lib/enigma2/python/StartEnigma.py", line 305, in open
    dlg = self.current_dialog = self.instantiateDialog(screen, arguments, *kwargs)
  File "/usr/lib/enigma2/python/StartEnigma.py", line 248, in instantiateDialog
    return self.doInstantiateDialog(screen, arguments, kwargs, self.desktop)
  File "/usr/lib/enigma2/python/StartEnigma.py", line 265, in doInstantiateDialog
    dlg = screen(self, arguments, *kwargs)
  File "/usr/lib/enigma2/python/Screens/FlashImage.py", line 513, in init
  File "/usr/lib/enigma2/python/Tools/Multiboot.py", line 83, in getCurrentImage
FileNotFoundError: [Errno 2] No such file or directory: '/sys/firmware/devicetree/base/chosen/bootargs'
[ePyObject] (CallObject(<bound method NumberActionMap.action of <Components.ActionMap.NumberActionMap object at 0xac5ee700>>,('OkCancelActions', 'ok')) failed)

Which happens on an OpenPLi image when you to into the multiboot selection screen. Which is triggered because /sys is completely empty.

The enigma2 code is really generic. If it found STARTUP file in the scanned location, it switch to multiboot mode. Probably that code should be improved.

So if /sys/firmware/devicetree/base/chosen/bootargs is not available enigma2 should switch batch to single boot mode and alert the user that multiboot is not available and it could be necessary to reinstall it, but i don't know if it possible to create a popup, but for sure we could trigger it to non multiboot mode and let the user to fix by reinstalling multiboot.

WanWizard commented 2 months ago

The problem isn't Enigma, the problem is the wrong kernel is written to flash, which can be detected and fixed, so I don't really see why we need changes to Enigma.

Also, you only get this when you go into that specific screen, but if you don't, other issues will appear, as you're running only half an operating system. This can lead to issues for the user (loss of functionality or even data), and for us (increased support requests).

So can we keep an eye on the ball please, and fix this issue instead of working around it?

dpeddi commented 2 months ago

Enigma should manage the missing of /sys/firmware/devicetree/base/chosen/bootargs

If we add a try catch or a check for the presence of the file we can alert the user that he could had fuck off the multiboot and he should fix it by reinstalling it. I think that's the proper way to proceed since it could work with other guest image too.. for sure we need the post_install workaround

WanWizard commented 2 months ago

It is not only that, all of /sys is missing. /boot isn't mounted properly, and there are more issues.

The last thing we need is allow Enigma to start which gives the impression that all is fine, they won't even realize that there is an issue until something serious happens, or they want to boot another slot, and realize they're gone.

It should be adressed as soon as slot 0 boots, not worked around in Enigma by an end-user that doesn't have our skillset.

dpeddi commented 1 month ago

Hello WanWizard,

The solution that we are going to implement in oe-a take some idea from what you suggested and some improvements however still needs testing.

The rest of the solution will need an updated recovery image.

The recovery script could be put in the default startup and it will try to fix the Recovery Image.... it check for the presence of /STARTUP and /STARTUP.cpio.gz. If present it assume the running image is kexec multiboot enabled, will check if /sys/firmware/devicetree/base/chosen/bootargs is missing, check if the last selected startup is in flash or not and locate the path to the guest kernel, then it dump the kernel in flash to the located path and flash the kexec kernel in the flash again then reboot

If isn't possible or the power user doesn't want to reflash the recovery he could run something like follows to be prepared to the situation

wget url/to/kexec-recovery-script.sh -o /etc/init.d/kexec-recovery-script.sh 
chmod 755 /etc/init.d/kexec-recovery-script.sh 
update-rc.d kexec-recovery-script.sh enable

I don't like so much the auto-recovery in startup by default because if something go wrong the user doesn't have the possibility to backup its files.

Maybe there are still some bug to fix (I'm still going to complete all the test on a spare box) but feel free to give a check and report if it seems good to you.

Thank you for rising to us these issues

WanWizard commented 1 month ago

Thanks :+1: . I'll have a look and try to keep OpenPLi in sync.