Open WanWizard opened 4 months ago
So this is a failure after a software update?
So, when is kernel-image-*
being updated? The source hasn't changed in years so why is the package being updated on interim builds?
At the beginning I've considered this possible issue. Since the kexec multiboot should be distribution agnostic, changing the kernel-post-install script from my point of view isn't a nice idea, so I'veconsidered a mount -o bind real-kexec-img /dev/mmcblk0pxx I was sure to have implemented that but can't find it on the GitHub repository of kexec-multiboot scripts. At the moment I'm busy, you can try to implement it and pull request. The scripts are on https://github.com/BlackHole/kexec-multiboot/tree/main/recipes-core/initrdscripts/files
So, when is
kernel-image-*
being updated? The source hasn't changed in years so why is the package being updated on interim builds?
The OpenPLi kirkstone build wrecked a lot of boxes because kernel-image
was amongst the updates, and this happened after the BSP was changed because code.vuplus.com was taken offline.
I agree that under normal circumstances it shouldn't be rebuild, but bitbake moves in mysterious ways...
I‘m a little bit confused because openPli is using it’s own build system. So why do you create an issue here? Or do I miss something?
Or do I miss something?
The offending code is present in the BSP of all images, including OE-A. So the same will happen if you would for example update OpenATV in a slot and it has a kernel update.
I posted this here to get consensus on a solution, we're all in the same boat.
The postinst probably needs to be something like
pkg_postinst_kernel-image () {
if [ -d /proc/stb ] ; then
DEST="/dev/${MTD_KERNEL}"
if [ -f /proc/cmdline -a -s /proc/cmdline ]; then
args=`cat /proc/cmdline`
for line in ${args};
do
key=${line%%=*}
value=${line#*=}
if [ "$key" == "kernel" ]; then
DEST="$value"
break
fi
done
fi
echo Kernel is located at ${DEST}
if [ -b "${DEST}" ]; then
dd if=/${KERNEL_IMAGEDEST}/${KERNEL_IMAGETYPE} of=/dev/${DEST}
elif [ -f "/boot/${DEST}" ]; then
cp -f /${KERNEL_IMAGEDEST}/${KERNEL_IMAGETYPE} /boot/${DEST}
else
echo "Can't determine type of ${DEST}!"
fi
fi
rm -f /${KERNEL_IMAGEDEST}/${KERNEL_IMAGETYPE}
true
}
The test for /proc/cmdline is probably not needed (given the earlier test for /proc/stb, but I've seen that after the issue occured and the box has rebooted into slot 0, /proc/cmdline doesn't exist, or exists but is 0 bytes.
@dpeddi The problem is that the kexec kernel that was installed in the kernel partiion is overwritten by the postinst of the kernel of the kernel-image
package being installed in one of the slots.
This can be fixed by making the postinst kexec aware (see above), but that doesn't fix it for older images.
As the kexec kernel has been wiped, I'm not sure if this could be addressed from within the kexec scripts, as they won't run anymore.
Since /usr/bin/kernel_auto.bin is still present on the box, it might be possible to revive the broken box simply by doing
dd if=/usr/bin/kernel_auto.bin of=/dev/${MTD_KERNEL}
(after determining what the MTD KERNEL device should be).
Or much easier, flash the kexec version of kernel_auto.bin with an usb stick to revive the box.
Or much easier, flash the kexec version of kernel_auto.bin with an usb stick to revive the box.
That does the same as my manual dd
command, no?
Is that simply copying kernel_auto.bin from /usr/bin to /vuplus/
Given the fact we will never be able to address this retroactively (for images already build), we need to have some procedure ready for users having this problem, so they don't start with a standard USB flash again, and wipe out all multiboot slots...
Appearently nobody here sees this as an issue that needs addressing?
Dpeddi ( the originator of the kexec kernel) is on holiday, so better to wait for his return. The issue is there, but has not caused any reported problems that I am aware of on the OE-A images.
Will give a look within a week or two
The issue is there, but has not caused any reported problems that I am aware of on the OE-A images.
Within days we've had several people with this issue.
It only occurs if the kernel-image package is in the updates. This happened for all when the VU+ recipes were altered due to code.vuplus.com going down, it also happens sometimes when changes are made to the BSP during development, and it happens in OpenPLi when people have installed a release candidate, and do a software update after the version is released (which upgrades the RC to the release version).
Ok, so I will ask again. Why is the package being updated when there is no change. I just checked our previous image version and the package name is identical. So why is the package name changing on PLi?
I know this is not the answer to the problem, but is the reason we don't see it.
The name doesn't change, the PR does after a new build, so opkg sees it as an update.
I agree there should not be anything to update, afaik none of the VU+ BSP changes has an influence on the kernel build. But bitbake decided to rebuild the kernel. Which gave the kernel-image package a new PR.
Come to think of it, chances are you won't really encounter this problem in OE-A images (under normal circumstances) as you don't use a PR server, so even if the package is built again, it will have the same PR, so opkg will not see it as an update.
The post-install could fix your latest image, so it could be ok... But if someone would install an old openpli then update, they will overwrite the kexec-kernel
Probably should be implemented both the post-install and some additional protection with mount -o bind /slot/kernel.img /dev/mmcblk0x in the kexec initrd
The kernel was rebuild because the kernel source was also downloaded from code.vuplus.com, so that was a SRC_URI that needed changing in the BSP. Again, OE-A didn't have that problem either, because everything is forked and copied to some OE-A specific location.
Probably should be implemented both the post-install and some additional protection with mount -o bind /slot/kernel.img /dev/mmcblk0x in the kexec initrd
That is beyond me I'm afraid.
@Huevos to be specific: OE-A images don't have an issue in this specific case (because the dependency on code.vuplus.com wasn't there), but will have the same problem as soon as there is a reason to manually bump the PR of the linux kernel recipe.
@Huevos to be specific: OE-A images don't have an issue in this specific case (because the dependency on code.vuplus.com wasn't there), but will have the same problem as soon as there is a reason to manually bump the PR of the linux kernel recipe.
Yes it was, we only changed that SRC_URI a few weeks ago, but not the PR because the code is identical.
Like I said, we use a PR server, so we don't control the PR, bitbake does. And it bumped the PR when the SRC_URI changed.
And for images that use a hardcoded PR, they also will have the issue when they need to bump the PR. Meaning that altough it wasn't an issue this time, it may become one in the future, so imho it is worth thinking about it, and not ignore it.
We will update it when @dpeddi is back from holiday.
@WanWizard Hello, I've attempted to implement the override of the kernel device using initrd but it can't work.
The guest image remount /dev so the overriden device become invisible.
So the solution you propose is the best available. I think we will include it in oe-a with credits to you.
And something in the initrd of slot 0 that can detect the wrong kernel has booted?
Because when the issue happens, half the filesystem is missing (like /sys, completely empty).
Also, my postinst suggestion has been written from the top of my head, not tested, so please double-check it.
Multiboot kernel consists of:
If the guest would flash the kernel it write a non kexec-kernel.
Without the kexec kernel no initrd is called so we can't implement what you are asking
So the next reboot it would start a kernel that could be misaligned by the kernel modules and the filesystem of the "recovery" image.
However what you describe is a bit strange. Usually no kernel modules are needed during normal boot to get all the file systems mounted and ethernet connectivity. On which box did this mount issue happened? Which recovery image was used? Which guest image?
My idea was that after the box is flashed, the user opts for installation on multiboot from Enigma.
This installs the kexec kernel, and my suggestion was to also include something in the kexec package, to be installed in /etc/init.d/ in slot 0, that can detect that slot 0 was booted because the kexec kernel in flash was overwritten.
I had OpenPLi develop in slot 0, but people with other images in slot 0 have reported the same, if the kexec kernel is overwritten due to a kernel update in a multiboot slot, the box reboots in slot 0, with the kernel that was written to flash in the update, but /sys was empty (which means that for example bootargs can't be read).
I agree with you that apart of some differences in kernel defconfig, all kernel images should be the same as VU+ has never updated one, so I can't explain what is going on. I can only report what I've seen myself, and what others reported: after the kexec kernel is overwritten, the box boots slot 0, enigma starts, but crashes when you start using it.
My idea was that after the box is flashed, the user opts for installation on multiboot from Enigma.
More or less all oe-a image should have an option to install kexec-multiboot after flashing.
This installs the kexec kernel, and my suggestion was to also include something in the kexec package, to be installed in /etc/init.d/ in slot 0, that can detect that slot 0 was booted because the kexec kernel in flash was overwritten. Nice idea but, if something went wrong with boot we don't have access to framebuffer and frontpanel so we can't display anything except in some logs.
@Huevos, what do you think?
True. I was thinking that, as the image in slot 0 is unusable anyway, when it issue is detected, simply dd the kexec into flash again ( which should still be in /usr/lib ), and reboot the box, which should fix it and boot the original slot again?
@WanWizard - so do you know where the box crashes? If its in E2 its probably manageable as all the components are available to fix it. e.g. You know the last flashed slot and both the kernel and kexec kernel are available
To get debug, I guess I can add code in slot 0 to write the original kernel to disk and then reboot to say slot 1 and see what happens.
@WanWizard - would that test setup match the issue as you see it?
@WanWizard - so do you know where the box crashes? If its in E2 its probably manageable as all the components are available to fix it. e.g. You know the last flashed slot and both the kernel and kexec kernel are available
Several locations, but the only one I have available from a user report is
Traceback (most recent call last):
File "/usr/lib/enigma2/python/Components/ActionMap.py", line 77, in action
File "/usr/lib/enigma2/python/Components/ActionMap.py", line 57, in action
File "/usr/lib/enigma2/python/Screens/Menu.py", line 56, in okbuttonClick
File "/usr/lib/enigma2/python/Tools/BoundFunction.py", line 10, in call
File "/usr/lib/enigma2/python/Screens/Menu.py", line 69, in runScreen
File "/usr/lib/enigma2/python/Screens/Menu.py", line 75, in openDialog
File "/usr/lib/enigma2/python/StartEnigma.py", line 295, in openWithCallback
dlg = self.open(screen, arguments, *kwargs)
File "/usr/lib/enigma2/python/StartEnigma.py", line 305, in open
dlg = self.current_dialog = self.instantiateDialog(screen, arguments, *kwargs)
File "/usr/lib/enigma2/python/StartEnigma.py", line 248, in instantiateDialog
return self.doInstantiateDialog(screen, arguments, kwargs, self.desktop)
File "/usr/lib/enigma2/python/StartEnigma.py", line 265, in doInstantiateDialog
dlg = screen(self, arguments, *kwargs)
File "/usr/lib/enigma2/python/Screens/FlashImage.py", line 513, in init
File "/usr/lib/enigma2/python/Tools/Multiboot.py", line 83, in getCurrentImage
FileNotFoundError: [Errno 2] No such file or directory: '/sys/firmware/devicetree/base/chosen/bootargs'
[ePyObject] (CallObject(<bound method NumberActionMap.action of <Components.ActionMap.NumberActionMap object at 0xac5ee700>>,('OkCancelActions', 'ok')) failed)
Which happens on an OpenPLi image when you to into the multiboot selection screen. Which is triggered because /sys is completely empty.
To get debug, I guess I can add code in slot 0 to write the original kernel to disk and then reboot to say slot 1 and see what happens.
I don't think this complexity is needed. When the issue occurs, slot 0 is always booted (as writing the guest kernel to flash effectively wipes out multiboot), and the kexec kernel file is still available:
bash-5.1# pwd
/boot/usr/bin
bash-5.1# ls -l | grep kern
-rwxr-xr-x 1 root root 6668928 Jan 26 17:56 kernel_auto.bin
so it could be dd'd back into flash, reboot, and the box will start it's original slot again.
( see the user instructions I wrote: https://wiki.openpli.org/Vu_Multiboot#Multiboot_images_missing_after_an_update.3F )
@WanWizard - so do you know where the box crashes? If its in E2 its probably manageable as all the components are available to fix it. e.g. You know the last flashed slot and both the kernel and kexec kernel are available
Several locations, but the only one I have available from a user report is
Traceback (most recent call last): File "/usr/lib/enigma2/python/Components/ActionMap.py", line 77, in action File "/usr/lib/enigma2/python/Components/ActionMap.py", line 57, in action File "/usr/lib/enigma2/python/Screens/Menu.py", line 56, in okbuttonClick File "/usr/lib/enigma2/python/Tools/BoundFunction.py", line 10, in call File "/usr/lib/enigma2/python/Screens/Menu.py", line 69, in runScreen File "/usr/lib/enigma2/python/Screens/Menu.py", line 75, in openDialog File "/usr/lib/enigma2/python/StartEnigma.py", line 295, in openWithCallback dlg = self.open(screen, arguments, *kwargs) File "/usr/lib/enigma2/python/StartEnigma.py", line 305, in open dlg = self.current_dialog = self.instantiateDialog(screen, arguments, *kwargs) File "/usr/lib/enigma2/python/StartEnigma.py", line 248, in instantiateDialog return self.doInstantiateDialog(screen, arguments, kwargs, self.desktop) File "/usr/lib/enigma2/python/StartEnigma.py", line 265, in doInstantiateDialog dlg = screen(self, arguments, *kwargs) File "/usr/lib/enigma2/python/Screens/FlashImage.py", line 513, in init File "/usr/lib/enigma2/python/Tools/Multiboot.py", line 83, in getCurrentImage FileNotFoundError: [Errno 2] No such file or directory: '/sys/firmware/devicetree/base/chosen/bootargs' [ePyObject] (CallObject(<bound method NumberActionMap.action of <Components.ActionMap.NumberActionMap object at 0xac5ee700>>,('OkCancelActions', 'ok')) failed)
Which happens on an OpenPLi image when you to into the multiboot selection screen. Which is triggered because /sys is completely empty.
The enigma2 code is really generic. If it found STARTUP file in the scanned location, it switch to multiboot mode. Probably that code should be improved.
So if /sys/firmware/devicetree/base/chosen/bootargs is not available enigma2 should switch batch to single boot mode and alert the user that multiboot is not available and it could be necessary to reinstall it, but i don't know if it possible to create a popup, but for sure we could trigger it to non multiboot mode and let the user to fix by reinstalling multiboot.
The problem isn't Enigma, the problem is the wrong kernel is written to flash, which can be detected and fixed, so I don't really see why we need changes to Enigma.
Also, you only get this when you go into that specific screen, but if you don't, other issues will appear, as you're running only half an operating system. This can lead to issues for the user (loss of functionality or even data), and for us (increased support requests).
So can we keep an eye on the ball please, and fix this issue instead of working around it?
Enigma should manage the missing of /sys/firmware/devicetree/base/chosen/bootargs
If we add a try catch or a check for the presence of the file we can alert the user that he could had fuck off the multiboot and he should fix it by reinstalling it. I think that's the proper way to proceed since it could work with other guest image too.. for sure we need the post_install workaround
It is not only that, all of /sys is missing. /boot isn't mounted properly, and there are more issues.
The last thing we need is allow Enigma to start which gives the impression that all is fine, they won't even realize that there is an issue until something serious happens, or they want to boot another slot, and realize they're gone.
It should be adressed as soon as slot 0 boots, not worked around in Enigma by an end-user that doesn't have our skillset.
Hello WanWizard,
The solution that we are going to implement in oe-a take some idea from what you suggested and some improvements however still needs testing.
The rest of the solution will need an updated recovery image.
The recovery script could be put in the default startup and it will try to fix the Recovery Image.... it check for the presence of /STARTUP and /STARTUP.cpio.gz. If present it assume the running image is kexec multiboot enabled, will check if /sys/firmware/devicetree/base/chosen/bootargs is missing, check if the last selected startup is in flash or not and locate the path to the guest kernel, then it dump the kernel in flash to the located path and flash the kexec kernel in the flash again then reboot
If isn't possible or the power user doesn't want to reflash the recovery he could run something like follows to be prepared to the situation
wget url/to/kexec-recovery-script.sh -o /etc/init.d/kexec-recovery-script.sh
chmod 755 /etc/init.d/kexec-recovery-script.sh
update-rc.d kexec-recovery-script.sh enable
I don't like so much the auto-recovery in startup by default because if something go wrong the user doesn't have the possibility to backup its files.
Maybe there are still some bug to fix (I'm still going to complete all the test on a spare box) but feel free to give a check and report if it seems good to you.
Thank you for rising to us these issues
Thanks :+1: . I'll have a look and try to keep OpenPLi in sync.
There've been long standing complaints that under certain conditions, the VU+ kexec multiboot isn't stable, which may lead to a broken kexec system (so the slot 0 image boots) or a non-booting box.
After suffering from this problem last night, I've decided to look into it.
The root cause seems to be that the postinst of the kernel-image package does a hardcoded
dd
into the kernel partition, which overwrites the kernel of the slot 0 image, not the one of the running image.This can be addressed in the BSP, by using something like findkerneldevice.sh, like other brands do, but that only fixes it for newly build images, not for all those images already out there. You could show a warning in enigma when
kernel-image-*
is amongst the packages being updated, but again that only addresses it for newly build images.Since this is an issue for all image makers, I'm interested in your thoughts on things, so we can come up with a common solution (if any).