loongson-community / discussions

Cross-community issue tracker & discussions / 跨社区工单追踪 & 讨论场所
7 stars 0 forks source link

[Linux] Random observed unaligned access and RCU stalls #34

Closed xen0n closed 3 months ago

xen0n commented 6 months ago
[36518.528465] Kernel ade access[#1]:
[36518.528474] CPU: 7 PID: 3510 Comm: node_exporter Not tainted 6.6.7-gentoo-dist #1 290a589f1b0484d0913dcf0dcf9f7484e7d1dd51
[36518.528479] Hardware name: Loongson Loongson-3A6000-HV-7A2000-1w-V0.1-EVB/Loongson-3A6000-HV-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05634-stable2
[36518.528481] pc 9000000002ae54c8 ra 9000000002ae6610 tp 900000010adc4000 sp 900000010adc7c00
[36518.528484] a0 900000000bc06140 a1 0000000000000000 a2 ffffffffffffeff8 a3 0000000000000000
[36518.528486] a4 0000000000000000 a5 0000000000000000 a6 0000000000000000 a7 0000000000000083
[36518.528488] t0 000a800d00000000 t1 000a800d00000000 t2 900000087fffc000 t3 0000000000000000
[36518.528490] t4 90000001003a7000 t5 0000000000000000 t6 0000000000000000 t7 0000000000000000
[36518.528492] t8 0000000000000000 u0 ffffffffffffffff s9 000000c0000004e0 s0 900000000bc06140
[36518.528494] s1 00000000000000b0 s2 9000000113d5f500 s3 0000000000003fa8 s4 90000001102ca000
[36518.528496] s5 0000000000000058 s6 9000000003ba0000 s7 0000000000000820 s8 0000000000000000
[36518.528498]    ra: 9000000002ae6610 refill_obj_stock+0x50/0x240
[36518.528504]   ERA: 9000000002ae54c8 drain_obj_stock+0xa8/0x300
[36518.528508]  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[36518.528515]  PRMD: 00000000 (PPLV0 -PIE -PWE)
[36518.528520]  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
[36518.528524]  ECFG: 00071c1c (LIE=2-4,10-12 VS=7)
[36518.528528] ESTAT: 00480000 [ADEM] (IS= ECode=8 EsubCode=1)
[36518.528532]  BADV: 000a800d00000890
[36518.528533]  PRID: 0014d000 (Loongson-64bit, Loongson-3A6000-HV)
[36518.528535] Modules linked in: amdgpu amdxcp drm_exec mfd_core gpu_sched drm_buddy drm_suballoc_helper drm_ttm_helper ttm drm_display_helper cec rc_core efi_pstore snd_hda_codec_realtek snd_hda_codec_generic pstore ledtrig_audio snd_hda_codec_hdmi spi_loongson_pci spi_loongson_core snd_hda_intel snd_intel_dspcfg snd_hda_codec nls_cp936 vfat fat snd_hda_core gpio_loongson_64bit snd_hwdep i2c_ls2x gpio_generic snd_pcm snd_timer rtc_loongson snd soundcore evdev wireguard libchacha20poly1305 libcurve25519_generic libchacha libpoly1305 bridge stp llc cfg80211 rfkill sch_fq_codel loop fuse nfnetlink nvme nvme_core xhci_pci igb nvme_common xhci_pci_renesas btrfs xor raid6_pq zlib_deflate dm_mirror dm_region_hash dm_log dm_mod dax efivarfs
[36518.528597] Process node_exporter (pid: 3510, threadinfo=000000002a30d58a, task=000000007ccc7fe1)
[36518.528600] Stack : 900000019543b380 0000000000003fa8 90000001103c9800 00000000000000b0
[36518.528605]         900000000bc06140 9000000002ae6610 0000000000000058 9000000002ae0970
[36518.528609]         0000000000000058 0000000000000058 0000000000000001 0000000000000000
[36518.528614]         90000001103c9800 9000000002aebec4 900000019543b380 0000000000000050
[36518.528618]         90000000027f96b0 9000000003ba0000 90000001103c9800 0000000000000820
[36518.528623]         9000000100017200 9000000002abcd4c 0000000000000000 19fa12c7429ba631
[36518.528627]         900000010adc7d80 0000000000000016 0000000000000016 90000001007b6a00
[36518.528632]         0000000000000017 0000000000000000 0000000000000820 0000000000000000
[36518.528636]         90000001954253c0 90000000027f96b0 900000010adc7e38 900000010adc7e38
[36518.528641]         00000000000000b4 9000000195425b40 0000000000000000 90000001954253c0
[36518.528646]         ...
[36518.528648] Call Trace:
[36518.528649] [<9000000002ae54c8>] drain_obj_stock+0xa8/0x300
[36518.528653] [<9000000002ae6610>] refill_obj_stock+0x50/0x240
[36518.528656] [<9000000002aebec4>] obj_cgroup_charge+0x144/0x2e0
[36518.528660] [<9000000002abcd4c>] kmem_cache_alloc+0xac/0x440
[36518.528663] [<90000000027f96b0>] __sigqueue_alloc+0x70/0x140
[36518.528668] [<90000000027fb648>] __send_signal_locked+0x288/0x440
[36518.528671] [<90000000027fd944>] do_send_sig_info+0xa4/0x1c0
[36518.528675] [<90000000027fdf84>] do_send_specific+0xc4/0x120
[36518.528677] [<90000000027fe078>] do_tkill+0x98/0x100
[36518.528679] [<900000000280174c>] sys_tgkill+0x2c/0x80
[36518.528682] [<90000000036bc928>] do_syscall+0x88/0xc0
[36518.528685] [<90000000027c14fc>] handle_syscall+0xbc/0x158

[36518.528689] Code: 0013b20c  0015358c  002d31ec <26089184> 03400000  28d2808c  5c01cdcc  03400000  03400000 

[36518.528700] ---[ end trace 0000000000000000 ]---
[36519.099450] pstore: backend (efi_pstore) writing error (-5)
[36539.940898] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[36539.940905] rcu:     7-...0: (0 ticks this GP) idle=8794/1/0x4000000000000000 softirq=2324040/2324040 fqs=5235
[36539.940911] rcu:     (detected by 5, t=21002 jiffies, g=3029045, q=10 ncpus=8)
[36539.940915] Sending NMI from CPU 5 to CPUs 7:
[36549.941032] rcu: rcu_sched kthread starved for 9995 jiffies! g3029045 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=0
[36549.941036] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[36549.941039] rcu: RCU grace-period kthread stack dump:
[36549.941041] task:rcu_sched       state:R  running task     stack:0     pid:16    ppid:2      flags:0x00000804
[36549.941048] Stack : 0000000000000000 0000000000000000 00000000000000b0 0000000000000004
[36549.941055]         0000000000000040 0000000000000040 0000000000000000 0000000000000000
[36549.941061]         0000000000000000 0000000000000000 0000000000000000 9000000003c88c98
[36549.941068]         9000000004012880 0000000000000000 9000000003774ac0 0000000000000001
[36549.941074]         000000000000003f 0000000000000001 900000000289e4c0 9000000003b9ff58
[36549.941080]         900000000bc08ac0 0000000000000040 0000000000000000 900000000289fa24
[36549.941086]         9000000003b96000 00000000000000b0 0000000000000004 0000000000000000
[36549.941092]         0000000000071c1d 0000000000001808 00000000000000b4 9000000003ba0288
[36549.941098]         9000000004013880 9000000004012880 0000000000000000 9000000003c88c98
[36549.941104]         9000000004013000 0000000000000000 9000000004013558 90000001004ffd90
[36549.941110]         ...
[36549.941113] Call Trace:
[36549.941114] [<90000000036c68d0>] __schedule+0xab0/0x1620

[36549.941124] rcu: Stack dump where RCU GP kthread last ran:
[36549.941126] Sending NMI from CPU 5 to CPUs 0:
[36549.941154] NMI backtrace for cpu 0
[36549.941160] CPU: 0 PID: 16 Comm: rcu_sched Tainted: G      D            6.6.7-gentoo-dist #1 290a589f1b0484d0913dcf0dcf9f7484e7d1dd51
[36549.941165] Hardware name: Loongson Loongson-3A6000-HV-7A2000-1w-V0.1-EVB/Loongson-3A6000-HV-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05634-stable2
[36549.941167] pc 9000000002ac3dd4 ra 9000000002ac3da4 tp 90000001004fc000 sp 90000001001c7880
[36549.941171] a0 0000000000000000 a1 0000000000000000 a2 0000000000000000 a3 0000000000000000
[36549.941174] a4 0000000000000000 a5 0000000000000000 a6 0000000000000000 a7 0000000000000000
[36549.941177] t0 00000000000000b0 t1 0000000000000004 t2 0000000000000000 t3 0000000000000000
[36549.941180] t4 0000000000000000 t5 0000000000000000 t6 0000000000000000 t7 0000000000000000
[36549.941183] t8 0000000000000000 u0 00000000000000b4 s9 90000006e099d018 s0 90000000022d1e08
[36549.941186] s1 0000000000000000 s2 00000000000000b4 s3 fffffffffe190000 s4 0000000000000000
[36549.941188] s5 9000000003ba0000 s6 0000000000000100 s7 fffffffffe190000 s8 90000006e099d022
[36549.941190]    ra: 9000000002ac3da4 kfence_guarded_free+0x124/0x3a0
[36549.941195]   ERA: 9000000002ac3dd4 kfence_guarded_free+0x154/0x3a0
[36549.941198]  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[36549.941206]  PRMD: 00000004 (PPLV0 +PIE -PWE)
[36549.941210]  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
[36549.941214]  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
[36549.941218] ESTAT: 00001000 [INT] (IS=12 ECode=0 EsubCode=0)
[36549.941222]  PRID: 0014d000 (Loongson-64bit, Loongson-3A6000-HV)
[36549.941224] CPU: 0 PID: 16 Comm: rcu_sched Tainted: G      D            6.6.7-gentoo-dist #1 290a589f1b0484d0913dcf0dcf9f7484e7d1dd51
[36549.941227] Hardware name: Loongson Loongson-3A6000-HV-7A2000-1w-V0.1-EVB/Loongson-3A6000-HV-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05634-stable2
[36549.941229] Stack : 0000000000000004 0000000000000000 90000000027c39e4 90000001004fc000
[36549.941234]         90000001001c73d0 90000001001c73d8 0000000000000000 90000001001c7518
[36549.941239]         90000001001c7510 90000001001c7510 0000000000000000 0000000000000000
[36549.941243]         0000000000000000 90000001001c73d8 19fa12c7429ba631 0000000000000000
[36549.941249]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[36549.941253]         0000000000000000 0000000000000000 0000000006894000 90000006e099d018
[36549.941257]         0000000000000000 0000000000000000 9000000003a14118 9000000003ba0000
[36549.941262]         9000000003ba8668 90000001001c7740 0000000000000000 0000000000000000
[36549.941267]         90000006e099d022 0000000000000000 90000000027c3a04 000000c000548000
[36549.941271]         00000000000000b0 0000000000000004 0000000000000000 0000000000071c1d
[36549.941276]         ...
[36549.941278] Call Trace:
[36549.941279] [<90000000027c3a04>] show_stack+0x64/0x1c0
[36549.941283] [<90000000036bb288>] dump_stack_lvl+0x78/0xb0
[36549.941286] [<900000000368b908>] nmi_cpu_backtrace+0x188/0x1a0
[36549.941291] [<90000000027c4010>] handle_backtrace+0x10/0x60
[36549.941294] [<90000000028e196c>] __flush_smp_call_function_queue+0x10c/0x360
[36549.941297] [<90000000027d0534>] loongson_ipi_interrupt+0x94/0x100
[36549.941300] [<90000000028801d0>] __handle_irq_event_percpu+0x50/0x160
[36549.941304] [<90000000028802f8>] handle_irq_event_percpu+0x18/0x80
[36549.941308] [<9000000002888f98>] handle_percpu_irq+0x58/0xc0
[36549.941311] [<900000000287eaa8>] generic_handle_domain_irq+0x28/0x60
[36549.941314] [<9000000002e7c2f8>] handle_cpu_irq+0x78/0xc0
[36549.941318] [<90000000036bb6b0>] handle_loongarch_irq+0x30/0x60
[36549.941321] [<90000000036bb7a8>] do_vint+0xc8/0x100
[36549.941325] [<900000000289fa24>] force_qs_rnp+0xc4/0x340
[36549.941329] [<90000000028a18ec>] rcu_gp_fqs_loop+0x34c/0x600
[36549.941332] [<90000000028a3ac4>] rcu_gp_kthread+0x164/0x1a0
[36549.941335] [<900000000281b4a0>] kthread+0x100/0x120
[36549.941338] [<90000000027c1628>] ret_from_kernel_thread+0xc/0xa4

Pending investigation.

xen0n commented 6 months ago

Issue is likely related to KFENCE it seems, rebuilding my 3A6000 kernel with KFENCE_SAMPLE_INTERVAL=0 (disable by default at run-time) to see if stability is restored.

xen0n commented 6 months ago

I've got a backtrace on an A2101 board (3A5000 + 7A1000) at the first time the oops occurred:

[51136.914980] CPU 3 Unable to handle kernel paging request at virtual address ffff800002474000, era == 90000000021f2160, ra == 90000000021f2138
[51136.927629] Oops[#1]:
[51136.929882] CPU: 3 PID: 878 Comm: jbd2/nvme0n1p5- Tainted: G           OE      7.6.9-gentoo-dist #1 2d35abbc4d75310e39af2a333fb880e2a8e5939a
[51136.942418] Hardware name: Loongson Loongson-3A5000-7A1000-1w-A2101/Loongson-LS3A5000-7A1000-1w-A2101, BIOS vUDK2018-LoongArch-V4.0.05132-beta10 12/13/202
[51136.956160] pc 90000000021f2160 ra 90000000021f2138 tp 90000001158b4000 sp 90000001001aba40
[51136.964460] a0 0000000000000001 a1 0000000000000000 a2 0000000000000000 a3 0000000000000000
[51136.972763] a4 0000000000000000 a5 0000000000000000 a6 0000000000000000 a7 0000000000000000
[51136.981071] t0 ffff800002473ffe t1 0000000000000000 t2 0000000000000000 t3 0000000002c00063
[51136.989380] t4 00000000000007ff t5 0000000000000000 t6 0000000000000000 t7 0000000000000000
[51136.997682] t8 0000000000000000 u0 0000000000000000 s9 90000001001abca0 s0 9000000003b82990
[51137.005990] s1 9000000003190f60 s2 90000000035b0000 s3 90000001001abac0 s4 90000000035b0000
[51137.014296] s5 9000000003b60000 s6 90000001001abab8 s7 90000001002aba50 s8 ffff8000024c4228
[51137.022597]    ra: 90000000021f2138 unwind_next_frame+0xd8/0x740
[51137.028569]   ERA: 90000000021f2160 unwind_next_frame+0x100/0x740
[51137.034624]  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[51137.040774]  PRMD: 00000000 (PPLV0 -PIE -PWE)
[51137.045101]  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
[51137.049861]  ECFG: 00071814 (LIE=2,4,11-12 VS=7)
[51137.054448] ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
[51137.059897]  BADV: ffff800002474000
[51137.063357]  PRID: 0014c010 (Loongson-64bit, Loongson-3A5000)
[51137.069064] Modules linked in: la_ow_syscall(OE) snd_seq_dummy snd_hrtimer snd_seq snd_seq_device joydev mousedev hid_multitouch usbhid ch341 tun amdgpu amdxcp drm_exec mfd_core gpu_sched drm_buddy drm_suballoc_helper drm_ttm_helper drm_display_helper cec rc_core spi_loongson_pci spi_loongson_core snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec ipmi_ssif snd_hda_core snd_hwdep acpi_ipmi snd_pcm ipmi_si gpio_loongson_64bit snd_timer gpio_generic i2c_ls2x ipmi_devintf rtc_loongson snd loongson ipmi_msghandler soundcore ttm nls_cp936 vfat fat evdev wireguard libchacha20poly1305 libcurve25519_generic libchacha libpoly1305 cfg80211 rfkill sch_fq_codel loop fuse efi_pstore pstore nfnetlink ext4 mbcache jbd2 nvme nvme_core nvme_common r8169 dwmac_loongson stmmac pcs_xpcs xhci_pci xhci_pci_renesas phylink btrfs xor raid6_pq zlib_deflate dm_mirror dm_region_hash dm_log dm_mod dax pkcs8_key_parser efivarfs
[51137.154539] Process jbd2/nvme0n1p5- (pid: 878, threadinfo=000000003ad9e375, task=00000000e3c0a5e8)
[51137.163444] Stack : 9000000009807d00 0000000000052dba 00000000000529c2 57a60d166999aba8
[51137.171412]         0000000000003000 0000000000000000 90000000022dd364 9000000115a33e40
[51137.179374]         90000001001abb08 90000000035b0000 90000001001abca0 90000000022dd180
[51137.187335]         90000001001abab8 90000000021ef134 00000000000001af 0000000000000001
[51137.195297]         0000000000000002 90000001158b4000 90000001158b8000 0000000000000000
[51137.203258]         9000000115a33e40 0000000000000001 90000001158b7b40 ffff8000024c4228
[51137.211218]         ffff8000024c6778 0000000000000000 0000000000000000 0000000000000000
[51137.219179]         90000001001abca0 0000000000000000 0000000000000000 0000000000000000
[51137.227142]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[51137.235100]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[51137.243063]         ...
[51137.245487] Call Trace:
[51137.245488] [<90000000021f2160>] unwind_next_frame+0x100/0x740
[51137.253704] [<90000000021ef134>] arch_stack_walk+0xd4/0x1a0
[51137.259239] [<90000000022dd364>] stack_trace_save+0x44/0xa0
[51137.264774] [<90000000024e25d0>] metadata_update_state+0xf0/0x140
[51137.270828] [<90000000024e39c4>] kfence_guarded_free+0x124/0x3a0
[51137.276795] [<ffff8000025ac7e4>] ext4_end_bio+0x44/0x1c0 [ext4]
[51137.282721] [<9000000002745e64>] blk_mq_end_request_batch+0x3e4/0x720
[51137.289123] [<ffff800002462e10>] nvme_irq+0x90/0xae0 [nvme]
[51137.294664] [<90000000022a02d0>] __handle_irq_event_percpu+0x50/0x160
[51137.301065] [<90000000022a04a4>] handle_irq_event+0x44/0x100
[51137.306687] [<90000000022a8138>] handle_edge_irq+0xf8/0x3a0
[51137.312221] [<900000000229eba8>] generic_handle_domain_irq+0x28/0x60
[51137.318534] [<900000000289c608>] eiointc_irq_dispatch+0xa8/0x1c0
[51137.324502] [<900000000229eba8>] generic_handle_domain_irq+0x28/0x60
[51137.330814] [<900000000289b7b8>] handle_cpu_irq+0x78/0xc0
[51137.336177] [<90000000030d9730>] handle_loongarch_irq+0x30/0x60
[51137.342057] [<90000000030d97ec>] do_vint+0x8c/0x100
[51137.346901] [<ffff8000024c4228>] __kstrtabns_jbd2_journal_put_journal_head+0x529c2/0x52dba [jbd2]
[51137.355737] CPU 3 Unable to handle kernel paging request at virtual address ffff800002474000, era == 90000000021f2160, ra == 90000000021f2138
[51160.505705] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[51160.511606] rcu:     3-...0: (9 ticks this GP) idle=b574/1/0x4000000000000000 softirq=1625364/1625366 fqs=2877
[51160.521291] rcu:     (detected by 1, t=21020 jiffies, g=2940689, q=838 ncpus=4)
[51160.528294] Sending NMI from CPU 1 to CPUs 3:
[51170.532823] rcu: rcu_sched kthread starved for 16905 jiffies! g2940689 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[51170.543110] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[51170.552012] rcu: RCU grace-period kthread stack dump:
[51170.557027] task:rcu_sched       state:R  running task     stack:0     pid:16    ppid:2      flags:0x00000800
[51170.566886] Stack : 0000000000000000 00002e817b633bc2 90000001002eade8 90000000030e54c8
[51170.574850]         0000000000000003 9000000003183d00 9000000003698c90 0000000000000000
[51170.582812]         9000000003a23a98 90000000035b0000 90000000035b0000 000000010307f74a
[51170.590776]         0000000000000001 0000000000000000 9000000003191e20 0000000000000000
[51170.598740]         0000000000000002 57a60d166999aba8 000000010307f74b 57a60d166999aba8
[51170.606700]         9000000003698c90 0000000000000000 9000000003a23a98 90000001003ebd90
[51170.614664]         90000000035a6000 90000001003ebd18 90000000035b0000 90000001002ea780
[51170.622625]         0000000000000003 90000000030e54c8 000000010307f74a 90000000030ed234
[51170.630590]         9000000003a23dc0 0000000000000122 0000000000000000 000000010307f74a
[51170.638549]         90000000022df940 0000000002c00001 90000001002ea780 57a60d166999aba8
[51170.646510]         ...
[51170.648934] Call Trace:
[51170.648935] [<90000000030e48f0>] __schedule+0xab0/0x1620
[51170.656637] [<90000000030e54c8>] schedule+0x68/0xe0
[51170.661482] [<90000000030ed234>] schedule_timeout+0x94/0x160
[51170.667104] [<90000000022c16d4>] rcu_gp_fqs_loop+0x114/0x5c0
[51170.672727] [<90000000022c3a64>] rcu_gp_kthread+0x164/0x1a0
[51170.678262] [<900000000223b320>] kthread+0x100/0x120
[51170.683194] [<90000000021e15e8>] ret_from_kernel_thread+0xc/0xa4
[51170.689160]
[51170.690631] rcu: Stack dump where RCU GP kthread last ran:
[51170.696077] Sending NMI from CPU 1 to CPUs 0:
[51170.700402] NMI backtrace for cpu 0
[51170.703869] CPU: 0 PID: 1716 Comm: node_exporter Tainted: G           OE      6.6.9-gentoo-dist #1 2d35abbc4d75310e39af2a333fb880e2a8e5939a
[51170.716318] Hardware name: Loongson Loongson-3A5000-7A1000-1w-A2101/Loongson-LS3A5000-7A1000-1w-A2101, BIOS vUDK2018-LoongArch-V4.0.05132-beta10 12/13/202
[51170.730059] pc 9000000002301d20 ra 9000000002301eb0 tp 900000011fe90000 sp 900000011fe93a50
[51170.738363] a0 0000000000000000 a1 0000000000000000 a2 0000000000000000 a3 0000000000000000
[51170.746664] a4 0000000000000000 a5 0000000000000000 a6 0000000000000000 a7 0000000000000000
[51170.754966] t0 0000000000000001 t1 900000000980db40 t2 0000000000000000 t3 0000000000000003
[51170.763272] t4 90000000035aff58 t5 0000000000000001 t6 0000000000000040 t7 0000000000000000
[51170.771572] t8 0000000000000000 u0 0000000000000001 s9 9000000008008d40 s0 00000000000000b0
[51170.779878] s1 900000011fe93b00 s2 900000011fe93c28 s3 900000011fe93c50 s4 000000c0004c4000
[51170.788182] s5 0000000000000000 s6 90000000035b0000 s7 9000000110755e80 s8 000000c000800000
[51170.796488]    ra: 9000000002301eb0 smp_call_function_many_cond+0x2d0/0x440
[51170.803407]   ERA: 9000000002301d20 smp_call_function_many_cond+0x140/0x440
[51170.810325]  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[51170.816473]  PRMD: 00000004 (PPLV0 +PIE -PWE)
[51170.820799]  EUEN: 00000001 (+FPE -SXE -ASXE -BTE)
[51170.825558]  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
[51170.830317] ESTAT: 00001000 [INT] (IS=12 ECode=0 EsubCode=0)
[51170.835939]  PRID: 0014c010 (Loongson-64bit, Loongson-3A5000)
[51170.841645] CPU: 0 PID: 1716 Comm: node_exporter Tainted: G           OE      6.6.9-gentoo-dist #1 2d35abbc4d75310e39af2a333fb880e2a8e5939a
[51170.854091] Hardware name: Loongson Loongson-3A5000-7A1000-1w-A2101/Loongson-LS3A5000-7A1000-1w-A2101, BIOS vUDK2018-LoongArch-V4.0.05132-beta10 12/13/202
[51170.867832] Stack : 0000000000000004 0000000000000000 90000000021e39a4 900000011fe90000
[51170.875789]         900000010019fcb0 900000010019fcb8 0000000000000000 900000010019fdf8
[51170.883753]         900000010019fdf0 900000010019fdf0 0000000000000000 0000000000000000
[51170.891714]         0000000000000000 900000010019fcb8 57a60d166999aba8 0000000000000000
[51170.899672]         0000000000000000 0000000000000000 0000000000000000 0000000000000000
[51170.907634]         0000000000000000 0000000000000000 0000000004e84000 9000000008008d40
[51170.915597]         0000000000000000 0000000000000000 9000000003424298 90000000035b0000
[51170.923559]         90000000035b8668 900000011fe93910 0000000000000000 0000000000000000
[51170.931518]         000000c000800000 0000000000000000 90000000021e39c4 000000c000508000
[51170.939480]         00000000000000b0 0000000000000004 0000000000000001 0000000000071c1d
[51170.947442]         ...
[51170.949865] Call Trace:
[51170.949866] [<90000000021e39c4>] show_stack+0x64/0x1c0
[51170.957389] [<90000000030d9308>] dump_stack_lvl+0x78/0xb0
[51170.962751] [<90000000030a9948>] nmi_cpu_backtrace+0x188/0x1a0
[51170.968547] [<90000000021e3fd0>] handle_backtrace+0x10/0x60
[51170.974081] [<90000000023016cc>] __flush_smp_call_function_queue+0x10c/0x360
[51170.981083] [<90000000021f0454>] loongson_ipi_interrupt+0x94/0x100
[51170.987225] [<90000000022a02d0>] __handle_irq_event_percpu+0x50/0x160
[51170.993626] [<90000000022a03f8>] handle_irq_event_percpu+0x18/0x80
[51170.999765] [<90000000022a9098>] handle_percpu_irq+0x58/0xc0
[51171.005385] [<900000000229eba8>] generic_handle_domain_irq+0x28/0x60
[51171.011697] [<900000000289b7b8>] handle_cpu_irq+0x78/0xc0
[51171.017062] [<90000000030d9730>] handle_loongarch_irq+0x30/0x60
[51171.022942] [<90000000030d97ec>] do_vint+0x8c/0x100
[51171.027785] [<9000000002301d20>] smp_call_function_many_cond+0x140/0x440
[51171.034443] [<90000000023020bc>] on_each_cpu_cond_mask+0x1c/0x40
[51171.040408] [<90000000021f0e78>] flush_tlb_range+0x78/0x180
[51171.045943] [<9000000002478038>] tlb_flush+0x58/0xc0
[51171.050872] [<90000000024786c8>] tlb_finish_mmu+0xe8/0x160
[51171.056319] [<90000000024639b8>] zap_page_range_single+0x138/0x240
[51171.062460] [<90000000024a1ea4>] madvise_vma_behavior+0x624/0xac0
[51171.068514] [<900000000249f438>] madvise_walk_vmas+0xb8/0x1e0
[51171.074221] [<90000000024a2588>] do_madvise+0x148/0x200
[51171.079410] [<90000000024a2980>] sys_madvise+0x20/0x40
xry111 commented 6 months ago

Hmm, "percpu" reminds me the extreme code model and could it be the notorious extreme code model issue?

heiher commented 6 months ago

Is it time to add some validation for extreme relocations in the kernel?

xen0n commented 6 months ago

Hmm, "percpu" reminds me the extreme code model and could it be the notorious extreme code model issue?

Perhaps no. My 3A6000 machine has been working perfectly for 2 days after I booted with kfence.sample_interval=0 in the kernel cmdline. Previously it would barely survive for 1hr so it's 99% the kfence implementation's fault.

xry111 commented 6 months ago

Hmm, "percpu" reminds me the extreme code model and could it be the notorious extreme code model issue?

Perhaps no. My 3A6000 machine has been working perfectly for 2 days after I booted with kfence.sample_interval=0 in the kernel cmdline. Previously it would barely survive for 1hr so it's 99% the kfence implementation's fault.

I mean maybe kfence is using some per-cpu variable and then blow up?

xen0n commented 6 months ago

Hmm, "percpu" reminds me the extreme code model and could it be the notorious extreme code model issue?

Perhaps no. My 3A6000 machine has been working perfectly for 2 days after I booted with kfence.sample_interval=0 in the kernel cmdline. Previously it would barely survive for 1hr so it's 99% the kfence implementation's fault.

I mean maybe kfence is using some per-cpu variable and then blow up?

Hmm then I'll have to check the asm later...

xry111 commented 3 months ago

Issue is likely related to KFENCE it seems, rebuilding my 3A6000 kernel with KFENCE_SAMPLE_INTERVAL=0 (disable by default at run-time) to see if stability is restored.

Guenter Roeck identified some issues caused by KFENCE: https://lore.kernel.org/loongarch/c352829b-ed75-4ffd-af6e-0ea754e1bf3d@roeck-us.net/

Not sure if it's exactly the same issue though.

jiegec commented 3 months ago

Regarding percpu, maybe the recent fix is related?

heiher commented 3 months ago

https://lore.kernel.org/loongarch/20240404133642.971583-1-chenhuacai@loongson.cn/

heiher commented 3 months ago

How to reproduce quickly:

Kconfig:

CONFIG_KFENCE=y
CONFIG_KFENCE_SAMPLE_INTERVAL=1
CONFIG_KFENCE_NUM_OBJECTS=65535

WARNING: The following steps will result in data loss.

while true; do
    blkdiscard -f /dev/nvme0n1p1 2>/dev/null
done
xen0n commented 3 months ago

Resolved in Linux v6.9-rc4.