linux-surface / linux-surface

Linux Kernel for Surface Devices
4.95k stars 211 forks source link

Go 3: Doesn't shutdown properly after using it #1111

Open PhilDevProg opened 1 year ago

PhilDevProg commented 1 year ago

[Description of the bug or feature] Sometimes after I used it for some hours and then shut it down, it doesn't shutdown properly. It seems to still be on because if I start it ~2 hours later, it has lost ~30% of batttery power. I also have to hold the power button ~30 seconds to see the Microsoft logo.

Environment

dmesg output: dmesg.txt

LubosD commented 1 year ago

Same for me. It really seems that when I intentionally try to reproduce the issue, it shuts down correctly.

But after some usage it appears to shut down, but it's still depleting the battery and getting warm.

qzed commented 1 year ago

Seeing as this seems to happen somewhat randomly, it's likely a different issue than on the SL5/SP9. Your log shows some suspend-test warning (did you try suspend-testing?) and some usbhid/hid-multitouch error that seems to happen on suspend(-testing). I'm not sure if that's related to shutdown though, but it's possible it is.

StollD commented 1 year ago

If there is a kernel error during shutdown, you can potentially use pstore to get the log after a reboot. I wrote this up during debugging of the SP9 / SL5 shutdown issue, but since that wasn't caused by a kernel panic, I don't know if it works:

grafik

I do know that it definitly won't work on Fedora though, because they dont enable the required modules.

PhilDevProg commented 1 year ago

Seeing as this seems to happen somewhat randomly, it's likely a different issue than on the SL5/SP9. Your log shows some suspend-test warning (did you try suspend-testing?) and some usbhid/hid-multitouch error that seems to happen on suspend(-testing). I'm not sure if that's related to shutdown though, but it's possible it is.

No, I didn't try anything suspend related...

qzed commented 1 year ago

No, I didn't try anything suspend related...

Ah, my mistake. I thought based on the name that this function is only called when suspend-testing is enabled, but it seems to be called any time the device is suspended.

In essence, there's a bunch of stuff like the excerpt below in the log. It looks like this doesn't happen on every suspend though.

[30332.001097] PM: suspend entry (s2idle)
[30332.506679] Filesystems sync: 0.505 seconds
[30332.752926] Freezing user space processes
[30332.758411] Freezing user space processes completed (elapsed 0.005 seconds)
[30332.758423] OOM killer disabled.
[30332.758426] Freezing remaining freezable tasks
[30332.759901] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[30332.759912] printk: Suspending console(s) (use no_console_suspend to debug)
[30332.795751] pcieport 0000:00:1c.3: Intel SPT PCH root port ACS workaround enabled
[30343.169232] usbhid 1-7:1.3: suspend error -5
[30343.193427] PM: suspend devices took 10.433 seconds
[30343.193442] ------------[ cut here ]------------
[30343.193445] Component: suspend devices, time: 10433
[30343.193460] WARNING: CPU: 0 PID: 21061 at kernel/power/suspend_test.c:53 suspend_test_finish+0x70/0x80
[30343.193485] Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep snd_sof_pci_intel_skl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_bus snd_soc_avs snd_soc_hda_codec snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc intel_rapl_msr dw9719 snd_soc_sst_dsp snd_soc_acpi_intel_match snd_hda_codec_hdmi snd_soc_acpi snd_soc_core snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl_common snd_compress ac97_bus intel_tcc_cooling snd_pcm_dmaengine x86_pkg_temp_thermal snd_hda_intel intel_powerclamp snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec gpio_tps68470 coretemp snd_hda_core
[30343.193670]  iwlmvm snd_hwdep kvm_intel mei_hdcp mei_pxp snd_seq mac80211 kvm snd_seq_device libarc4 irqbypass btusb snd_pcm rapl btrtl btbcm btintel btmtk binfmt_misc iwlwifi intel_cstate wmi_bmof vfat intel_uncore pcspkr fat bluetooth snd_timer snd hid_sensor_gyro_3d hid_sensor_accel_3d hid_sensor_magn_3d hid_sensor_incl_3d hid_sensor_als soundcore hid_sensor_rotation hid_sensor_trigger cfg80211 hid_sensor_iio_common industrialio_triggered_buffer joydev intel_skl_int3472_tps68470 kfifo_buf industrialio rfkill ipu3_imgu(C) mei_me tps68470_regulator mei ipu3_cio2 videobuf2_dma_sg videobuf2_memops videobuf2_v4l2 ov7251 ov8865 idma64 videobuf2_common intel_xhci_usb_role_switch ov5693 videodev clk_tps68470 intel_pch_thermal intel_skl_int3472_discrete mc soc_button_array surface_hotplug intel_hid acpi_pad sparse_keymap loop zram dm_crypt hid_sensor_hub intel_ishtp_hid crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic i915 nvme rtsx_pci_sdmmc drm_buddy mmc_core
[30343.193947]  drm_display_helper nvme_core intel_ish_ipc cec ghash_clmulni_intel ucsi_acpi hid_multitouch rtsx_pci nvme_common typec_ucsi sha512_ssse3 ttm intel_ishtp typec i2c_hid_acpi i2c_hid video wmi pinctrl_sunrisepoint ip6_tables ip_tables fuse
[30343.194009] CPU: 0 PID: 21061 Comm: systemd-sleep Tainted: G        WC         6.2.8-1.surface.fc37.x86_64 #1
[30343.194021] Hardware name: Microsoft Corporation Surface Go 3/Surface Go 3, BIOS 9.103.143 02/09/2022
[30343.194026] RIP: 0010:suspend_test_finish+0x70/0x80
[30343.194041] Code: 03 00 00 29 c1 e8 10 9e 00 00 81 fb 10 27 00 00 77 07 5b 5d c3 cc cc cc cc 89 da 48 89 ee 48 c7 c7 3b b2 85 a0 e8 30 cc f7 ff <0f> 0b 5b 5d c3 cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90
[30343.194049] RSP: 0018:ffffae52e6187d58 EFLAGS: 00010286
[30343.194059] RAX: 0000000000000000 RBX: 00000000000028c1 RCX: 0000000000000000
[30343.194065] RDX: 0000000000000002 RSI: ffffffffa08bd1b6 RDI: 00000000ffffffff
[30343.194072] RBP: ffffffffa085b11d R08: ffffffffa0c64780 R09: 0000000000ffff0a
[30343.194078] R10: 0000000000000004 R11: 0000000000000029 R12: 0000000000000001
[30343.194083] R13: ffff91a534a9b910 R14: 0000000000000004 R15: ffff91a4ed07b920
[30343.194090] FS:  00007fe92742bb40(0000) GS:ffff91a5a7400000(0000) knlGS:0000000000000000
[30343.194098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30343.194104] CR2: 000055a7a3613726 CR3: 00000001243dc005 CR4: 00000000003706f0
[30343.194111] Call Trace:
[30343.194117]  <TASK>
[30343.194127]  suspend_devices_and_enter+0x17c/0x960
[30343.194147]  pm_suspend+0x25a/0x590
[30343.194160]  state_store+0x68/0xd0
[30343.194174]  kernfs_fop_write_iter+0x11b/0x200
[30343.194192]  vfs_write+0x241/0x400
[30343.194212]  ksys_write+0x6b/0xf0
[30343.194224]  do_syscall_64+0x58/0x80
[30343.194243]  ? exc_page_fault+0x70/0x170
[30343.194259]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[30343.194277] RIP: 0033:0x7fe92771e284
[30343.194287] Code: 15 b1 7b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 80 3d 7d 03 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
[30343.194294] RSP: 002b:00007ffeff87af58 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[30343.194303] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe92771e284
[30343.194308] RDX: 0000000000000004 RSI: 00007ffeff87b040 RDI: 0000000000000004
[30343.194313] RBP: 00007ffeff87b040 R08: 0000560a0acc5600 R09: 0000000000000000
[30343.194318] R10: 0000560a0a491158 R11: 0000000000000202 R12: 0000000000000004
[30343.194323] R13: 0000560a0acc13b0 R14: 0000000000000004 R15: 00007fe9277f2a00
[30343.194340]  </TASK>
[30343.194343] ---[ end trace 0000000000000000 ]---
[30343.215108] ACPI: EC: interrupt blocked
[30343.219829] intel_pch_thermal 0000:00:14.2: CPU-PCH is cool [33C]
[31075.563067] ACPI: EC: interrupt unblocked
[31075.575445] pcieport 0000:00:1c.3: Intel SPT PCH root port ACS workaround enabled
[31075.575445] pcieport 0000:00:1c.0: Intel SPT PCH root port ACS workaround enabled
[31075.592927] pcieport 0000:00:1c.6: Intel SPT PCH root port ACS workaround enabled
[31075.827141] PM: resume devices took 0.238 seconds
[31075.827447] OOM killer enabled.
[31075.827450] Restarting tasks ... 
[31075.837438] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[31075.873138] done.
[31075.873185] random: crng reseeded on system resumption
[31075.883996] PM: suspend exit

So, sometimes the suspend entry seems to take a long time, likely because something USB related failed before (so this might run into some timeout, I guess?).

PhilDevProg commented 1 year ago

@LubosD Are you using Fedora? If not, could you test the method by @StollD the next time it happens?

LubosD commented 1 year ago

I use KDE Neon. I'll check it out.

FWIW I recently had a black screen hang also when doing a reboot.

fvasco commented 1 year ago

Same issue on Surface Go 1

PhilDevProg commented 1 year ago

@fvasco Are you using a distro which isn't Fedora/Fedora based? If yes, could you try the debugging guide StollID suggested: https://github.com/linux-surface/linux-surface/issues/1111#issuecomment-1493108327 the next time it happens?

fvasco commented 1 year ago

Sorry, we are using Kubuntu

PhilDevProg commented 1 year ago

Sorry, we are using Kubuntu

Which isn't a Fedora-based distro. @StollD Does your guide work with Ubuntu/Debian?

qzed commented 1 year ago

You could give it a try, worst case it won't work / there won't be a log.

StollD commented 1 year ago

The ubuntu config we are using enables the efi_pstore module but it is not built into vmlinux and instead built as a loadable module. I don't know what other kernel params you might need to force the kernel to load it.

fvasco commented 1 year ago

I found this after an hang on shutdown.

After the boot:

# dmesg |grep -i pstore
[    7.217006] systemd[1]: Starting modprobe@efi_pstore.service - Load Kernel Module efi_pstore...
[    7.250387] pstore: Using crash dump compression: deflate
[    7.339502] pstore: Registered efi_pstore as persistent store backend
[    7.341034] systemd[1]: modprobe@efi_pstore.service: Deactivated successfully.
[    7.341380] systemd[1]: Finished modprobe@efi_pstore.service - Load Kernel Module efi_pstore.
[    7.341826] systemd[1]: systemd-pstore.service - Platform Persistent Storage Archival was skipped because of an unmet condition check (ConditionDirectoryNotEmpty=/sys/fs/pstore).

Did I miss some boot option?

[    0.000000] Command line: BOOT_IMAGE=/@/boot/vmlinuz-6.2.13-surface root=... ro rootflags=subvol=@ quiet splash resume=... vt.handoff=7
PhilDevProg commented 1 year ago

I found this after an hang on shutdown.

After the boot:

# dmesg |grep -i pstore
[    7.217006] systemd[1]: Starting modprobe@efi_pstore.service - Load Kernel Module efi_pstore...
[    7.250387] pstore: Using crash dump compression: deflate
[    7.339502] pstore: Registered efi_pstore as persistent store backend
[    7.341034] systemd[1]: modprobe@efi_pstore.service: Deactivated successfully.
[    7.341380] systemd[1]: Finished modprobe@efi_pstore.service - Load Kernel Module efi_pstore.
[    7.341826] systemd[1]: systemd-pstore.service - Platform Persistent Storage Archival was skipped because of an unmet condition check (ConditionDirectoryNotEmpty=/sys/fs/pstore).

Did I miss some boot option?

[    0.000000] Command line: BOOT_IMAGE=/@/boot/vmlinuz-6.2.13-surface root=... ro rootflags=subvol=@ quiet splash resume=... vt.handoff=7

@fvasco Did you update grub after adding the pstore boot parameters in /etc/default/grub? You can do that by running sudo update-grub

fvasco commented 1 year ago

Yes @PhilDevProg, I tried also pstore.backend=efi efi_pstore.pstore_disable=0 (link), same error: ConditionDirectoryNotEmpty=/sys/fs/pstore.

Yes, I checked the flag activation in dmesg.

PhilDevProg commented 1 year ago

Yes @PhilDevProg, I tried also pstore.backend=efi efi_pstore.pstore_disable=0 (link), same error: ConditionDirectoryNotEmpty=/sys/fs/pstore.

The error seems to suggest that the file/directory /sys/fs/pstore exists/isn't empty. You probably have to clean the directory/remove the file after every boot. So remove it now and see, if it works the next time the bug occurs. (If you reboot before the bug occurs, remove the file/directory after booting)

fvasco commented 1 year ago

@PhilDevProg, "/sys/fs/pstore" was empty, so the error. I don't know what should be in that folder. Unfortunately, I don't know what should be in that folder, or if the message is just an info and not an error. Every suggestion is welcome.

PhilDevProg commented 1 year ago

@PhilDevProg, "/sys/fs/pstore" was empty, so the error. I don't know what should be in that folder. Unfortunately, I don't know what should be in that folder, or if the message is just an info and not an error. Every suggestion is welcome.

Ah, okay. Maybe @StollD knows something?

StollD commented 1 year ago

pstore will only have data if there actually was a kernel crash during shutdown (and not some other quirk like some hardware timing out and not shutting down properly).

There is a way to trigger a crash to test if pstore works, but I don't really remember how to do that. Probably something with /proc/sysrq-trigger.

PhilDevProg commented 1 year ago

(and not some other quirk like some hardware timing out and not shutting down properly).

Is there a way to check if that's the case here?