linuxboot / heads

A minimal Linux that runs as a coreboot or LinuxBoot ROM payload to provide a secure, flexible boot environment for laptops, workstations and servers.
https://osresearch.net/
GNU General Public License v2.0
1.42k stars 187 forks source link

t440p: Stability issues: intermittent OS Kernel panics / segfaults on boot, shutdown #1413

Closed akunterkontrolle closed 1 year ago

akunterkontrolle commented 1 year ago

Please identify some basic details to help process the report

A. Provide Hardware Details

1. What board are you using (see list of boards here)?

2. Does your computer have a dGPU or is it iGPU-only?

3. Who installed Heads on this computer?

4. What PGP key is being used?

5. Are you using the PGP key to provide HOTP verification?

B. Identify how the board was flashed

1. Is this problem related to updating heads or flashing it for the first time?

2. If the problem is related to an update, how did you attempt to apply the update?

3. How was Heads initially flashed

4. Was the board flashed with a maximized or non-maximized/legacy rom?

5. If Heads was externally flashed, was IFD unlocked?

C. Identify the rom related to this bug report

1. Did you download or build the rom at issue in this bug report?

2. If you downloaded your rom, where did you get it from?

Please provide the release number or otherwise identify the rom downloaded

3. If you built your rom, which repository:branch did you use?

4. What version of coreboot did you use in building?

5. In building the rom where did you get the blobs?

Please describe the problem

Describe the bug On booting after kexec-ing into the os kernel, their is a ca. 50% chance of the kernel panicing either directly after kexecing before even producing any output or shortly afterwards e.g. directly after entering the disk unlock passphrase. If the init-system starts, there is a good chance of the system booting up normally and then fully working without any problems. On very few occasions even after successfully starting init there is still a chance of a kernel panic or some programs randomly segfaulting. On some occasions this happens also on shutdown directly before the system should power off, instead it hangs with a kernel panic.

Hardware: Thinkpad t440p without d-gpu. Memory: 16GB, CPU: i7-4810MQ, upgraded Touchpad from t450, upgraded to SATA-SSD.

Current OS: Gentoo with Kernel 6.1.28

I am rather confident that the problem is not faulty hardware or depending on the OS (kernel version). I did not encounter any kernel panics or other weird behavior when running libreboot or skulls. However when using heads all GNU/Linux distros with various kernel versions produced the same behavior - I tried Rocky Linux, Debian 11.6, Devuan Chimaera and Gentoo.

Sadly I can't really provide much more information than that: Especially on startup there is often an os kernel panic, when well, there shouldn't be … The heads kernel doesn't panic at all. Has any other owner of a t440p with heads experienced this?

To Reproduce

  1. Start laptop, do normal boot process including hotp verification and selecting default boot options.
  2. Expect a ca. 50% chance of the OS kernel panicing on boot either directly after kexec or sometimes later e.g. shorly after entering the disk encryption passphrase. Obvious sign if nothing is visible on screen is a constant fan speed up.
  3. If boot succeeded their is still a chance of a kernel panic on shutdown immediately before the system is supposed to power down

Expected behavior No kernel panics, normal running OS.

Screenshots I could try to take a picture with my phone of the parts of the panic message that fit on the screen.

Additional context Is their any method to get logs after the crash? Obviously the kernel logs are gone since I need to hard power-off the laptop after a kernel panic. On the very few occasions where it proceeded to boot and "only" a few programs segfaulted, I sadly forgot to save any logs. I think I read somewhere that (coreboot-)logs from previous boots could be extracted from heads, but I couldn't find that information anymore. Without any kind of logs I have a feeling it is impossible to determine what is going wrong.

tlaurion commented 1 year ago

Is their any method to get logs after the crash? Obviously the kernel logs are gone since I need to hard power-off the laptop after a kernel panic. On the very few occasions where it proceeded to boot and "only" a few programs segfaulted, I sadly forgot to save any logs.

Yes, a picture of the segfault should be minimally provided. This will point into what the kernel was doing at the moment of the fault (driver involved, memory management, etc)

I think I read somewhere that (coreboot-)logs from previous boots could be extracted from heads, but I couldn't find that information anymore. Without any kind of logs I have a feeling it is impossible to determine what is going wrong. Yes, those logs are exposed within Heads through cbmem -c (all logs from console, truncating themselves upon multiple boots) and cbmem -1 (this is one). Not sure it would be helpful, though, but you could export them from Heads on a usb thumbdrive doing

mount-usb rw
cbmem -1 > /media/cbmem_last_boot.log

My only intuition there would be about the ram init blob there (MRC.bin borrowed from a Haswell chromebook) and some weird corruption happening in the same regions from coreboot (the CBFS_SIZE has been lowered under 8mb until native ram init is par with blob initialized ram upstream).

The difference with libreboot here are: size of CBFS_SIZE (ROM size, kernel) and payload being linux kexec'ing into another kernel (where in your case here, you are booting into gentoo, and where the board config (boards/t440p-maximized.config) is not stating export CONFIG_BOOT_KERNEL_ADD="intel_iommu=on intel_iommu=igfx_off" and on x230-maximized. That might be needed to make sure the kexec'ed kernel is applying the proper iommu settings. Randomess across boots is weird to say the least. coreboot config (config/coreboot-t440p.config) is specifying the proper initial kernel boot options: CONFIG_LINUX_COMMAND_LINE="intel_iommu=igfx_off drm_kms_helper.drm_leak_fbdev_smem=1 i915.enable_fbc=0".

From #692 t400p board owners/testers t440p: @ThePlexus @srgrint @akunterkontrolle same behavior observed? Insights? @rbreslow (added under issue) and here as well.

akunterkontrolle commented 1 year ago

I think I read somewhere that (coreboot-)logs from previous boots could be extracted from heads, but I couldn't find that information anymore. Without any kind of logs I have a feeling it is impossible to determine what is going wrong. Yes, those logs are exposed within Heads through cbmem -c (all logs from console, truncating themselves upon multiple boots) and cbmem -1 (this is one). Not sure it would be helpful, though, but you could export them from Heads on a usb thumbdrive doing

mount-usb rw
cbmem -1 > /media/cbmem_last_boot.log

Thank you for your reply and your advice! Obviously the expected thing happened: When you want a bug to appear, it doesn't do it anymore like you want it to do… There is also one snag to exporting and saving the coreboot log that in hindsight is obvious but didn't appear to me: After writing it to a usb drive a syncand/or unmount /media is necessary otherwise the data is lost.

However I did manage to get a bit of data that is hopefully somewhat helpful - sadly the logs and kernel dumps are completely beyond my understanding.

  1. Kernel dump after a reboot: kernel_panic_after_reboot This did happen immediately after seeing kernel output, before getting asked for the disk encryption passphrase.

  2. Something segfaulting, while system was still running (sadly this is also only a picture since I was to incompetent to mount my usb drive successfully from inside the rescue shell): segfault_on_decrypting_disk This happened immidiately after I entered the disk encryption passphrase. I may or may not have "helped" triggering that by removing my Nitrokey after entering the encryption passphrase. This time I managed to save a coreboot log properly after rebooting into heads: cbmem_crash1.log

  3. Kernel panic immediately after kexec-ing where the OS kernel didn't even produce any visible output yet. (Therefore no picture, the only visible thing was the heads output and a white garbled stripe on the top of the screen.) After a forceful shutdown, I saved the following coreboot log: cbmem_crash2.log

I hope that this information is of any help, I try to get more kernel dumps and corresponding coreboot logs in the next days in the questionable hope of getting the behavior more often again … (And sorry for the quality of the pictures, I should have used a proper camera on a tripod probably …)

srgrint commented 1 year ago

From #692 t400p board owners/testers t440p: @ThePlexus @srgrint @akunterkontrolle same behavior observed? Insights? @rbreslow (added under issue) and here as well.

I only use my t440p for testing, not as a daily driver. I don't even have any networking, sounds, etc plugged in keep it as easy as possible to externally reflash if needed. Hence I have not particularly stress tested my machine. From my own limited testing though, I have not experienced any random kernel panics.

I appreciate @akunterkontrolle dosen't feel likely a hardware issue. In difficult to pin down bugs like this, though, would suggest if possible try swapping out the RAM and see if you get the same problem using different sticks. Over the years (not heads specific), most of my kernel panics (apart from when using very bleeding edge software) has either been RAM or thermal issues. I have sometimes had certain RAM sticks only work with certain coreboot versions - hence my suggestion.

I presume you have repasted your CPU in the last couple of years and the thermal metrics are within reasonable limits?

If your problem persists despite this, I'll try and stress test my T440p and check what happens.

(I have used various builds of heads on my T440p. Currently have the version build by CircleCI for the testing branch in #1398 From a RAM configuration point of view, I currently have a 2Gb stick in both slots )

tlaurion commented 1 year ago

@akunterkontrolle have you tried #535 last comment? Please reopen this issue when done.