linrunner / TLP

TLP - Optimize Linux Laptop Battery Life
https://linrunner.de/tlp
GNU General Public License v2.0
2.67k stars 128 forks source link

Kernel page fault on fresh tlp install #734

Closed clinche closed 6 months ago

clinche commented 6 months ago

[x] I've read and accepted the Bug Reporting Howto [x] I've provided all required tlp-stat outputs via Gist (see below)

Describe the bug

On a fresh install of tlp on arch using latest kernel and drivers, tlp start gets killed because of a kernel page fault

Expected behavior

tlp start should not crash

To Reproduce

Steps to reproduce the unexpected behavior:

  1. Does the problem occur on battery or AC or both? both
  2. Actions to reproduce the behaviour install tlp and tlp start
  3. Shell commands entered and their output
    
    $~> sudo pacman -S tlp                                                                                            
    resolving dependencies...
    looking for conflicting packages...

Packages (1) tlp-1.6.1-1

Total Installed Size: 0.51 MiB

:: Proceed with installation? [Y/n] (1/1) checking keys in keyring [####################################################################################################] 100% (1/1) checking package integrity [####################################################################################################] 100% (1/1) loading package files [####################################################################################################] 100% (1/1) checking for file conflicts [####################################################################################################] 100% (1/1) checking available disk space [####################################################################################################] 100% :: Processing package changes... (1/1) installing tlp [####################################################################################################] 100% Optional dependencies for tlp bash-completion: Bash completion ethtool: Disable Wake On Lan [installed] smartmontools: Display S.M.A.R.T. data in tlp-stat [installed] tp_smapi: Older ThinkPad battery functions (before Sandy Bridge) :: Running post-transaction hooks... (1/3) Reloading system manager configuration... (2/3) Reloading device manager configuration... (3/3) Arming ConditionNeedsUpdate... $~> sudo tlp start
[1] 225571 killed sudo tlp start $~> sudo dmesg | tail -n 70 [13139.815603] BUG: unable to handle page fault for address: 000000000000417b [13139.815609] #PF: supervisor read access in kernel mode [13139.815612] #PF: error_code(0x0000) - not-present page [13139.815613] PGD 0 P4D 0 [13139.815616] Oops: 0000 [#11] PREEMPT SMP NOPTI [13139.815619] CPU: 5 PID: 225574 Comm: tlp Tainted: P D OE 6.8.1-zen1-1-zen #1 b323528be95a9fcd5f079f3701e4b81b6249e552 [13139.815622] Hardware name: ASUSTeK COMPUTER INC. ASUS TUF Gaming F15 FX506HM_FX506HM/FX506HM, BIOS FX506HM.313 08/12/2022 [13139.815624] RIP: 0010:simple_xattr_get+0x31/0xa0 [13139.815630] Code: 00 00 41 56 49 89 ce 41 55 4c 8d 6f 08 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 89 ef e8 c7 87 af 00 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 43 02 ac 00 85 c0 78 27 74 2b 48 8b 5b 08 [13139.815633] RSP: 0018:ffffb56607997ba8 EFLAGS: 00010206 [13139.815635] RAX: 0000000000001400 RBX: 0000000000004163 RCX: 0000000000000000 [13139.815637] RDX: 0000000000000000 RSI: ffffffffab5c070f RDI: ffff8ebf43fa7310 [13139.815638] RBP: ffffffffab5c070f R08: 0000000000000000 R09: 0000000000000000 [13139.815640] R10: ffffffffab5c070f R11: ffff8ec1f6709440 R12: 0000000000000000 [13139.815641] R13: ffff8ebf43fa7310 R14: 0000000000000000 R15: ffff8ebfcf5bae00 [13139.815642] FS: 00007c8a5d3ebb80(0000) GS:ffff8ec6ab540000(0000) knlGS:0000000000000000 [13139.815644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13139.815646] CR2: 000000000000417b CR3: 00000004c7d5a004 CR4: 0000000000f70ef0 [13139.815648] PKRU: 55555554 [13139.815649] Call Trace: [13139.815651] [13139.815654] ? __die+0x10f/0x120 [13139.815657] ? page_fault_oops+0x171/0x4e0 [13139.815660] ? sched_clock+0x10/0x30 [13139.815663] ? ptep_set_access_flags+0x32/0x40 [13139.815666] ? exc_page_fault+0x7f/0x180 [13139.815669] ? asm_exc_page_fault+0x26/0x30 [13139.815674] ? simple_xattr_get+0x31/0xa0 [13139.815676] ? simple_xattr_get+0x29/0xa0 [13139.815678] vfs_getxattr+0x7f/0xb0 [13139.815681] cap_inode_need_killpriv+0x1e/0x30 [13139.815684] security_inode_need_killpriv+0x2d/0x50 [13139.815687] dentry_needs_remove_privs+0x32/0x60 [13139.815690] do_truncate+0x70/0xf0 [13139.815694] path_openat+0x1004/0x14c0 [13139.815699] do_filp_open+0xb3/0x160 [13139.815702] ? pfx_kfree_link+0x10/0x10 [13139.815705] __x64_sys_openat+0x1d5/0x240 [13139.815708] do_syscall_64+0x86/0x170 [13139.815712] ? do_syscall_64+0x96/0x170 [13139.815714] ? exc_page_fault+0x7f/0x180 [13139.815717] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [13139.815720] RIP: 0033:0x7c8a5d55ed42 [13139.815768] Code: 83 e2 40 75 53 89 f0 f7 d0 a9 00 00 41 00 74 48 80 3d 11 53 0e 00 00 74 6c 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 92 00 00 00 48 8b 54 24 28 64 48 2b 14 25 [13139.815770] RSP: 002b:00007ffecb2f9650 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 [13139.815772] RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007c8a5d55ed42 [13139.815773] RDX: 0000000000000241 RSI: 00005944ac5796f0 RDI: 00000000ffffff9c [13139.815775] RBP: 00005944ac5796f0 R08: 0000000000000000 R09: 0000000000000020 [13139.815776] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000000000000000 [13139.815777] R13: 0000000000000003 R14: 00005944ac5796f0 R15: 00005944ab09d320 [13139.815780] [13139.815780] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter br_netfilter bridge stp llc ccm uhid cmac algif_hash algif_skcipher af_alg bnep overlay btusb btrtl btintel btbcm btmtk bluetooth ecdh_generic crc16 qrtr snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_codec_hdmi intel_uncore_frequency intel_uncore_frequency_common joydev mousedev x86_pkg_temp_thermal intel_powerclamp mt7921e snd_hda_codec_realtek mt7921_common snd_hda_codec_generic coretemp mt792x_lib mt76_connac_lib kvm_intel hid_multitouch mt76 [13139.815832] snd_hda_intel xe snd_intel_dspcfg kvm vfat snd_intel_sdw_acpi mac80211 processor_thermal_device_pci_legacy fat snd_hda_codec processor_thermal_device processor_thermal_wt_hint irqbypass processor_thermal_rfim snd_hda_core intel_rapl_msr r8169 iTCO_wdt processor_thermal_rapl rapl libarc4 snd_hwdep intel_pmc_bxt asus_nb_wmi intel_lpss_pci intel_rapl_common drm_gpuvm mei_hdcp mei_pxp iTCO_vendor_support ee1004 realtek snd_pcm asus_wmi intel_lpss drm_exec intel_cstate ucsi_acpi processor_thermal_wt_req nvidia_drm(POE) cfg80211 spi_nor ledtrig_audio gpu_sched snd_timer typec_ucsi mdio_devres processor_thermal_power_floor platform_profile mei_me intel_uncore snd i2c_i801 typec processor_thermal_mbox libphy mtd nvidia_modeset(POE) thunderbolt mei soundcore i2c_smbus rfkill intel_soc_dts_iosf idma64 wmi_bmof roles drm_suballoc_helper pcspkr intel_pmc_core i2c_hid_acpi i2c_hid int3403_thermal intel_vsec int340x_thermal_zone pmt_telemetry int3400_thermal pinctrl_tigerlake acpi_thermal_rel acpi_pad pmt_class [13139.815884] intel_hid asus_wireless sparse_keymap mac_hid vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) uinput pkcs8_key_parser nvidia_uvm(POE) crypto_user fuse loop nfnetlink ip_tables x_tables hid_generic usbhid dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod btrfs blake2b_generic crct10dif_pclmul libcrc32c crc32_pclmul crc32c_generic crc32c_intel xor polyval_clmulni raid6_pq polyval_generic gf128mul ghash_clmulni_intel serio_raw sha512_ssse3 atkbd sha256_ssse3 libps2 vivaldi_fmap sha1_ssse3 nvme aesni_intel crypto_simd nvme_core spi_intel_pci xhci_pci cryptd spi_intel nvme_auth xhci_pci_renesas i8042 serio vmwgfx vboxvideo drm_vram_helper drm_ttm_helper nvidia(POE) i915 i2c_algo_bit drm_buddy video wmi ttm intel_gtt drm_display_helper cec [13139.815937] CR2: 000000000000417b [13139.815939] ---[ end trace 0000000000000000 ]--- [13139.815940] RIP: 0010:simple_xattr_get+0x31/0xa0 [13139.815943] Code: 00 00 41 56 49 89 ce 41 55 4c 8d 6f 08 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 89 ef e8 c7 87 af 00 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 43 02 ac 00 85 c0 78 27 74 2b 48 8b 5b 08 [13139.815944] RSP: 0018:ffffb566207abaf8 EFLAGS: 00010206 [13139.815946] RAX: 0000000000000000 RBX: 0000000000004163 RCX: 0000000000000000 [13139.815947] RDX: 0000000000000000 RSI: ffffffffab5c070f RDI: ffff8ebf43fa7310 [13139.815949] RBP: ffffffffab5c070f R08: 0000000000000000 R09: 0000000000000000 [13139.815950] R10: ffffffffab5c070f R11: ffff8ebf53440de0 R12: 0000000000000000 [13139.815951] R13: ffff8ebf43fa7310 R14: 0000000000000000 R15: ffff8ec1641f4e00 [13139.815952] FS: 00007c8a5d3ebb80(0000) GS:ffff8ec6ab540000(0000) knlGS:0000000000000000 [13139.815954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13139.815955] CR2: 000000000000417b CR3: 00000004c7d5a004 CR4: 0000000000f70ef0 [13139.815957] PKRU: 55555554 [13139.815958] note: tlp[225574] exited with irqs disabled [13139.815959] note: tlp[225574] exited with preempt_count 1


4. **Full output of `tlp-stat` via https://gist.github.com/ for *all*
   matching cases of 1** (not as file attachment, no screenshots)
https://gist.github.com/clinche/8260fa6233cb878202c1114a8278fd8f#file-tlp-stat-battery-log
https://gist.github.com/clinche/d8f01e7722e0d37c1ae962f4e46aa3fb#file-tlp-stat-ac-log
linrunner commented 6 months ago

Clearly a kernel issue. Let's see if we can find out where TLP is pinching the kernel. Please activate trace mode via configuration

TLP_DEBUG="arg bat disk lock nm path pm ps rf run sysfs udev usb

and show the output of

sudo tlp-stat -T

immediately after the fault occurs i.e. after tlp start.

ps. did you also try with linux-lts or the regular Arch linux kernel?

clinche commented 6 months ago

Thanks for reminding me I'm daily driving linux-zen After switching to upstream linux kernel, TLP installed and ran just fine, and of course when I went back to linux-zen it also worked just fine

I tinkered the settings and it seems to be stable, no error so far so maybe it was an environment error, guess I'll never know

Anyway thanks for your time and your project, it looks great!

xoores commented 5 months ago

Ran into the same issue, here is the log if it helps:

12.04 00:43:02  tlp[29708]: parse_args4config: tlp start --
12.04 00:43:02  tlp[29708]: +++ start (1.6.1) ++++++++++++++++++++++++++++++++++++++++
12.04 00:43:02  tlp[29708]: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/lib/llvm/17/bin:/usr/lib/llvm/16/bin:/usr/lib/llvm/15/bin:/opt/nvidia-cg-toolkit/bin:/etc/eselect/wine/bin:/opt/gcc-arm-none-eabi/bin:/usr/lib64/opencascade/bin:/root/.local/bin:/root/bin
12.04 00:43:02  tlp[29708]: SHELL=/bin/bash; umask=0022
12.04 00:43:02  tlp[29708]: get_sys_power_supply(AC).ac_online: syspwr=0
12.04 00:43:02  tlp[29708]: clear_manual_mode
12.04 00:43:02  tlp[29708]: power_source=ac
12.04 00:43:02  tlp[29708]: manual_mode=none
12.04 00:43:02  tlp[29708]: power_mode=ac
12.04 00:43:02  tlp[29708]: lock_tlp().success
12.04 00:43:02  tlp[29708]: compare_and_save_power_state(0).equal
12.04 00:43:02  tlp[29708]: set_laptopmode(0): 0; rc=0
12.04 00:43:02  tlp[29708]: set_dirty_parms(0): 1500; ec=0
12.04 00:43:02  tlp[29708]: set_platform_profile(0).not_available
12.04 00:43:02  tlp[29708]: set_cpu_driver_opmode(0).not_configured
12.04 00:43:02  tlp[29708]: set_cpu_scaling_governor(0).not_configured
12.04 00:43:02  tlp[29708]: set_cpu_scaling_min_max_freq(0).not_configured
12.04 00:43:02  tlp[29708]: set_intel_cpu_perf_pct(0).min.not_configured
12.04 00:43:02  tlp[29708]: set_intel_cpu_perf_pct(0).max.not_configured
12.04 00:43:02  tlp[29708]: set_cpu_boost_all(0).intel_pstate: 1
12.04 00:43:02  tlp[29708]: set_cpu_dyn_boost(0).intel_pstate: 1; rc=0
12.04 00:43:02  tlp[29708]: set_cpu_perf_policy(0).epp.write_error: balance_performance /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference; rc=0
12.04 00:43:02  tlp[29708]: set_cpu_perf_policy(0).epp.write_error: balance_performance /sys/devices/system/cpu/cpu1/cpufreq/energy_performance_preference; rc=0
12.04 00:43:02  tlp[29708]: set_cpu_perf_policy(0).epp.write_error: balance_performance /sys/devices/system/cpu/cpu10/cpufreq/energy_performance_preference; rc=0
12.04 00:43:02  tlp[29708]: set_cpu_perf_policy(0).epp.write_error: balance_performance /sys/devices/system/cpu/cpu11/cpufreq/energy_performance_preference; rc=0
12.04 00:43:02  kernel: BUG: unable to handle page fault for address: 00000000000059fe
12.04 00:43:02  kernel: #PF: supervisor read access in kernel mode
12.04 00:43:02  kernel: #PF: error_code(0x0000) - not-present page
12.04 00:43:02  kernel: PGD 0 P4D 0
12.04 00:43:02  kernel: Oops: 0000 [#40] PREEMPT SMP NOPTI
12.04 00:43:02  kernel: CPU: 0 PID: 29708 Comm: tlp Tainted: P      D    O       6.7.3-gentoo #5
12.04 00:43:02  kernel: Hardware name: Dell Inc. Precision 7760/0KCD5R, BIOS 1.13.0 06/07/2022
12.04 00:43:02  kernel: RIP: 0010:simple_xattr_get+0x28/0xa0
12.04 00:43:02  kernel: Code: 90 90 41 56 49 89 ce 41 55 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 8d 6f 08 4c 89 ef e8 50 4a 1b 01 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 bc 31 19 01 85 c0 78 2f 74 33 48 8b 5b 08
12.04 00:43:02  kernel: RSP: 0018:ffffa70747607be8 EFLAGS: 00010202
12.04 00:43:02  kernel: RAX: 0000000000000000 RBX: 00000000000059e6 RCX: 0000000000000000
12.04 00:43:02  kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
12.04 00:43:02  kernel: RBP: ffffffff8aa21e13 R08: 0000000000000000 R09: 0000000000000000
12.04 00:43:02  kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
12.04 00:43:02  kernel: R13: ffff94e440b59400 R14: 0000000000000000 R15: 0000000000000000
12.04 00:43:02  kernel: FS:  00007f5dc60a4b80(0000) GS:ffff95034fc00000(0000) knlGS:0000000000000000
12.04 00:43:02  kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
12.04 00:43:02  kernel: CR2: 00000000000059fe CR3: 0000000754618002 CR4: 0000000000772ef0
12.04 00:43:02  kernel: PKRU: 55555554
12.04 00:43:02  kernel: Call Trace:
12.04 00:43:02  kernel:  <TASK>
12.04 00:43:02  kernel:  ? __die+0x1a/0x70
12.04 00:43:02  kernel:  ? page_fault_oops+0x17c/0x4b0
12.04 00:43:02  kernel:  ? exc_page_fault+0x63/0x130
12.04 00:43:02  kernel:  ? asm_exc_page_fault+0x22/0x30
12.04 00:43:02  kernel:  ? simple_xattr_get+0x28/0xa0
12.04 00:43:02  kernel:  ? simple_xattr_get+0x20/0xa0
12.04 00:43:02  kernel:  __vfs_getxattr+0x76/0xc0
12.04 00:43:02  kernel:  cap_inode_need_killpriv+0x15/0x30
12.04 00:43:02  kernel:  security_inode_need_killpriv+0x24/0x40
12.04 00:43:02  kernel:  dentry_needs_remove_privs+0x2f/0x60
12.04 00:43:02  kernel:  do_truncate+0x67/0xf0
12.04 00:43:02  kernel:  path_openat+0xfd5/0x1260
12.04 00:43:02  kernel:  do_filp_open+0xaf/0x170
12.04 00:43:02  kernel:  ? __pfx_kfree_link+0x10/0x10
12.04 00:43:02  kernel:  do_sys_openat2+0xac/0xe0
12.04 00:43:02  kernel:  __x64_sys_openat+0x50/0xa0
12.04 00:43:02  kernel:  do_syscall_64+0x3f/0x100
12.04 00:43:02  kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0x76
12.04 00:43:02  kernel: RIP: 0033:0x7f5dc61e0cbe
12.04 00:43:02  kernel: Code: 83 e2 40 75 4f 89 f0 f7 d0 a9 00 00 41 00 74 44 80 3d b5 b6 0d 00 00 74 68 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 8e 00 00 00 48 8b 54 24 28 64 48 2b 14 25
12.04 00:43:02  kernel: RSP: 002b:00007fff6b030040 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
12.04 00:43:02  kernel: RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007f5dc61e0cbe
12.04 00:43:02  kernel: RDX: 0000000000000241 RSI: 000055987193c9f0 RDI: 00000000ffffff9c
12.04 00:43:02  kernel: RBP: 000055987193c9f0 R08: 0000000000000000 R09: 0000000000000020
12.04 00:43:02  kernel: R10: 00000000000001b6 R11: 0000000000000202 R12: 0000000000000003
12.04 00:43:02  kernel: R13: 0000000000000000 R14: 000055987193c9f0 R15: 0000000000000001
12.04 00:43:02  kernel:  </TASK>
12.04 00:43:02  kernel: Modules linked in: uinput nvidia_uvm(PO) tun ch341 r8153_ecm r8152 bpfilter nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) i915 iwlmvm rtsx_pci_sdmmc kvm_intel serio_raw i2c_algo_bit drm_buddy iwlwifi vfio_pci rtsx_pci ttm e1000e vfio_pci_core drm_display_helper mei_pxp mei_hdcp
12.04 00:43:02  kernel: CR2: 00000000000059fe
12.04 00:43:02  kernel: ---[ end trace 0000000000000000 ]---
12.04 00:43:02  kernel: RIP: 0010:simple_xattr_get+0x28/0xa0
12.04 00:43:02  kernel: Code: 90 90 41 56 49 89 ce 41 55 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 8d 6f 08 4c 89 ef e8 50 4a 1b 01 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 bc 31 19 01 85 c0 78 2f 74 33 48 8b 5b 08
12.04 00:43:02  kernel: RSP: 0018:ffffa707477efbe8 EFLAGS: 00010202
12.04 00:43:02  kernel: RAX: 0000000000000000 RBX: 00000000000059e6 RCX: 0000000000000000
12.04 00:43:02  kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
12.04 00:43:02  kernel: RBP: ffffffff8aa21e13 R08: 0000000000000000 R09: 0000000000000000
12.04 00:43:02  kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
12.04 00:43:02  kernel: R13: ffff94e440b59400 R14: 0000000000000000 R15: 0000000000000000
12.04 00:43:02  kernel: FS:  00007f5dc60a4b80(0000) GS:ffff95034fc00000(0000) knlGS:0000000000000000
12.04 00:43:02  kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
12.04 00:43:02  kernel: CR2: 00000000000059fe CR3: 0000000754618002 CR4: 0000000000772ef0
12.04 00:43:02  kernel: PKRU: 55555554
12.04 00:43:02  kernel: note: tlp[29708] exited with irqs disabled
12.04 00:43:02  kernel: note: tlp[29708] exited with preempt_count 1

Seems like cpufreq is not a happy camper for some reason. Will try to upgrade kernel later & see if the issue went away.

clinche commented 5 months ago

@xoores

despite the kernel (linux and linux-zen, both latest) crashing only with tlp, and me thinking it was the culprit, one time when I tried to shutdown, systemd caused the exact same kernel bug (with RIP: simple_xattr_get) and system froze. Thought it was a kernel issue, but tldr I managed to trace it back to the nvidia driver, which apparently tickles cgroups a little too much, from what I've found on the net (but cant manage to find back linkshere)

Installing nvidia-open-dkms instead of nvidia-dkms seemed to have done the trick, and my system is now back to stable, no kernel bugs for a week and tlp works fine

cc @linrunner for info