pop-os / system76-acpi-dkms

System76 ACPI Driver (DKMS)
GNU General Public License v2.0
16 stars 11 forks source link

Kerneloops #10

Closed Rebreda closed 3 years ago

Rebreda commented 3 years ago

Distribution (run cat /etc/os-release): Fedora 34 Gnome 40.3.0 Wayland

Issue/Bug Description: I sometimes get crashes that look like below: Error message as follows:

A kernel problem occurred, but your kernel has been tainted (flags:GSDOEL). Explanation:
S - SMP with CPUs not designed for SMP.
D - Kernel has oopsed before
O - Out-of-tree module has been loaded.
E - Unsigned module has been loaded.
L - A soft lockup has previously occurred.
Kernel maintainers are unable to diagnose tainted reports. Tainted modules: system76_acpi.

Seemingly happens at random, sometimes in a burst (5 in 10 mins) or one off. Can't seem to find much else about it other than system76_acpi being called out. Would love some more info on it.

jacobgkau commented 3 years ago

What hardware are you using?

Can you please post a complete sudo dmesg output after the issue has occurred? The portion you quoted doesn't show an error message, it just shows that system76_acpi is loaded (but not necessarily where the problem occurred.)

Rebreda commented 3 years ago
BIOS Information
        Vendor: coreboot
        Version: 2021-03-11_50eedc2
        Release Date: 03/11/2021
        ROM Size: 16 MB
        Characteristics:
                PCI is supported
                PC Card (PCMCIA) is supported
                BIOS is upgradeable
                Selectable boot is supported
                ACPI is supported
                Targeted content distribution is supported
        BIOS Revision: 4.13
        Firmware Revision: 0.0
System Information
        Manufacturer: System76
        Product Name: Lemur Pro
        Version: lemp9
        Serial Number: 123456789
        UUID: Not Settable
        Wake-up Type: Reserved
        SKU Number: Not Specified
        Family: Not Specified

Filtering dmesg for system76_acpi:

[  +0.000004] BTRFS info (device dm-0): disk space caching is enabled
[  +0.200732] system76_acpi: loading out-of-tree module taints kernel.
[  +0.004751] input: Intel HID events as /devices/platform/INT33D5:00/input/input14
[  +0.001200] system76_acpi: module verification failed: signature and/or required key missing - tainting kernel
[  +0.018221] input: System76 ACPI Hotkeys as /devices/LNXSYSTM:00/LNXSYBUS:00/17761776:00/input/input15
[  +0.000048] ACPI: battery: new extension: System76 Battery Extension
[  +0.063071] mc: Linux media interface: v0.10

I've included some of the logging before and after the messages for context. Thanks for the help!

jacobgkau commented 3 years ago

@gabehab Please provide an unfiltered dmesg so we can see the kernel oops.


System76 customers can reach out to support for technical assistance. For non-System76 hardware, you can seek community support on Reddit or Mattermost.

jacobgkau commented 3 years ago

@gabehab I'm not seeing a kernel oops in that log. Where is the kernel crash that you're seeing?

Rebreda commented 3 years ago

dmesg-aug-7.txt

sorry wrong file, lots of reboots these days.

jacobgkau commented 3 years ago

Thank you. Here is the crash:

Aug 07 14:08:27.442818 fedora kernel: page dumped because: VM_BUG_ON_PAGE(PageTail(page))
Aug 07 14:08:27.442836 fedora kernel: ------------[ cut here ]------------
Aug 07 14:08:27.442854 fedora kernel: kernel BUG at include/linux/pagemap.h:247!
Aug 07 14:08:27.442937 fedora kernel: invalid opcode: 0000 [#1] SMP NOPTI
Aug 07 14:08:27.443044 fedora kernel: CPU: 4 PID: 50278 Comm: systemd-userwor Tainted: G S         OE     5.13.6-200.fc34.x86_64 #1
Aug 07 14:08:27.443075 fedora kernel: Hardware name: System76 Lemur Pro/Lemur Pro, BIOS 2021-03-11_50eedc2 03/11/2021
Aug 07 14:08:27.443095 fedora kernel: RIP: 0010:next_uptodate_page+0x23e/0x2a0
Aug 07 14:08:27.443114 fedora kernel: Code: 01 83 f8 01 0f 87 0e ff ff ff e8 fd f8 00 00 e9 19 ff ff ff e8 73 f8 00 00 e9 0f ff ff ff 48 c7 c6 00 69 5d 9d e8 a2 72 03 00 <0f> 0b 48 8b 03 48 8b 40 08 e9 6d fe ff ff 48 c7 c6 90 ab 59 9d e8
...
Aug 07 14:08:27.443292 fedora kernel: Call Trace:
Aug 07 14:08:27.443309 fedora kernel:  filemap_map_pages+0x435/0x700
Aug 07 14:08:27.443324 fedora kernel:  __handle_mm_fault+0x126c/0x1570
Aug 07 14:08:27.443339 fedora kernel:  handle_mm_fault+0xd5/0x2b0
Aug 07 14:08:27.443358 fedora kernel:  do_user_addr_fault+0x1b7/0x670
Aug 07 14:08:27.443376 fedora kernel:  exc_page_fault+0x78/0x160
Aug 07 14:08:27.443401 fedora kernel:  ? asm_exc_page_fault+0x8/0x30
Aug 07 14:08:27.443421 fedora kernel:  asm_exc_page_fault+0x1e/0x30
Aug 07 14:08:27.443436 fedora kernel: RIP: 0033:0x7fcd698d7000

I'm no kernel engineer, but the references to pages make this sound like a potential RAM issue. I see you're running kernel 5.13.6. Do these crashes also occur if you run Pop!_OS (with kernel 5.11) from a live disk? If so, this could be defective hardware. If not, then it could simply be a kernel bug.

Just to be clear, system76_acpi tainting the kernel is expected and not an issue in itself. If the crash was related to system76_acpi, then I would expect to see it referenced in the error messages/traces. Do the crashes still occur if you remove system76_acpi from your Fedora installation?

Rebreda commented 3 years ago

So, I don't think it is defective RAM as I've never had this problem before upgrading from kernel 5.8 (as mentioned here https://github.com/pop-os/linux/issues/45). It looks like there's a more serious issue with kernel 5.11.

The reason I switched off PopOS (and 5.11) was specifically to see if I could get my machine to be stable and stop crashing every couple hours. Fedora with 5.13 seemed to be pretty solid, at least not crashing regularly (KDE was freezing a lot, so switched to gnome 40 and alls well that side of things). I then installed the corresponding S76 modules to get better battery performance, etc. However, it then started crashing more regularly. As a result, I did uninstall system76_acpi yesterday and it seemingly hasn't crashed since... although just a small period of time has passed, so far so good without system76_acpi.

jacobgkau commented 3 years ago

@gabehab Do you have a support ticket open? If your system is not stable running Pop!_OS, then there is most likely a hardware issue, or else all lemp9 owners would be having the problem, which is not the case. With the only machines that we were able to recreate the 5.11 issues on in our lab, the issues went away after replacing the RAM (even when the old, defective RAM was appearing to pass a RAM test.) Even if it's not the RAM itself, it could be the RAM slot or the motherboard. Just because some versions of the kernel don't interact with the hardware in the way that triggers the problem doesn't mean there's not a problem.

(Just emphasizing this because I wouldn't want you to get stuck with bad hardware thinking that a workaround has solved the problem, only for it to come back later.)

Rebreda commented 3 years ago

Got it - I'll open a support ticket and see how it plays out.

Thanks for your help @jacobgkau!