loongson / Firmware

Firmware Of LoongArch Machines
84 stars 18 forks source link

XA61200主板上搭配RX550独显出现cpu死锁现象 #83

Open Fearyncess opened 6 months ago

Fearyncess commented 6 months ago

故障触发条件:在未设置其他额外amdgpu相关参数的情况下,在firefox内调用amdgpu驱动提供的VAAPI硬解接口,较长时间(3分钟到10分钟不等)持续播放任意高码率H264视频(未超出rx550硬解单元处理能力范围)。 后使用amdgpu.pcie_gen_cap=0x00020002参数强制锁定显卡仅使用PCIe2.0速率,该问题不再出现。

故障症状:图形界面死锁,其中一个cpu核心死锁,看门狗当机,键盘鼠标操作均无反应。

故障固件:UDK2018_3A6000-7A2000_Desktop_EVB_V4.0.05636-stable202311_support_fastboot_rel.fd

如何复现故障:

死锁时的journalctl日志


Dec 13 00:34:21 Misha kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1061579, emitted seq=1061580
Dec 13 00:34:21 Misha kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 917 thread Xorg:cs0 pid 924
Dec 13 00:34:21 Misha kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Dec 13 00:34:32 Misha kernel: watchdog: Watchdog detected hard LOCKUP on cpu 2
Dec 13 00:34:32 Misha kernel: Modules linked in: qrtr snd_hda_codec_conexant snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep kvm spi_>
Dec 13 00:34:32 Misha kernel: Sending NMI from CPU 1 to CPUs 2:
Dec 13 00:34:32 Misha kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Dec 13 00:34:32 Misha kernel: rcu:         2-...!: (0 ticks this GP) idle=6388/0/0x0 softirq=494920/494920 fqs=0 (false positive?)
Dec 13 00:34:32 Misha kernel: rcu:         3-...!: (1 ticks this GP) idle=bb6c/1/0x4000000000000000 softirq=493654/493655 fqs=0
Dec 13 00:34:32 Misha kernel: rcu:         (detected by 6, t=21025 jiffies, g=1352673, q=13 ncpus=8)
Dec 13 00:34:32 Misha kernel: rcu: rcu_preempt kthread timer wakeup didn't happen for 21032 jiffies! g1352673 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
Dec 13 00:34:32 Misha kernel: rcu:         Possible timer handling issue on cpu=2 timer-softirq=945830
Dec 13 00:34:32 Misha kernel: rcu: rcu_preempt kthread starved for 21055 jiffies! g1352673 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
Dec 13 00:34:32 Misha kernel: rcu:         Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
Dec 13 00:34:32 Misha kernel: rcu: RCU grace-period kthread stack dump:
Dec 13 00:34:32 Misha kernel: task:rcu_preempt     state:I stack:0     pid:18    tgid:18    ppid:2      flags:0x00000800
Dec 13 00:34:32 Misha kernel: Stack : 9000000005966930 0000000000000000 0000000000080000 900000000a801400
Dec 13 00:34:32 Misha kernel:         0000000000000402 900000010065e5f8 9000000004870b14 90000001006c3d08
Dec 13 00:34:32 Misha kernel:         0000000000000000 90000000049ca240 0000000000000000 9000000004879e98
Dec 13 00:34:32 Misha kernel:         90000000049c4398 90000000032c22a0 900000000a801400 900000010065df40
Dec 13 00:34:32 Misha kernel:         00000000000000b0 9000000000000004 90000000049d2008 0000000000000000
Dec 13 00:34:32 Misha kernel:         0000000000000002 86a32dec93ad0052 00000001009ffdda 86a32dec93ad0052
Dec 13 00:34:32 Misha kernel:         0000000000000001 9000000005976798 0000000000000001 90000001006c3d80
Dec 13 00:34:32 Misha kernel:         9000000005076000 9000000005080000 90000001006c3d08 900000010065df40
Dec 13 00:34:32 Misha kernel:         9000000005976000 9000000004870b14 00000001009ffdd9 9000000004878a08
Dec 13 00:34:32 Misha kernel:         0000000000000000 0000000000000000 900000000a801540 00000001009ffdd9
Dec 13 00:34:32 Misha kernel:         ...
Dec 13 00:34:32 Misha kernel: Call Trace:
Dec 13 00:34:32 Misha kernel: [<900000000486f858>] __schedule+0x5f8/0x1880
Dec 13 00:34:32 Misha kernel: [<9000000004870b14>] schedule+0x34/0x140
Dec 13 00:34:32 Misha kernel: [<9000000004878a08>] schedule_timeout+0x88/0x140
Dec 13 00:34:32 Misha kernel: [<900000000334624c>] rcu_gp_fqs_loop+0x14c/0x740
Dec 13 00:34:32 Misha kernel: [<90000000033497d8>] rcu_gp_kthread+0x238/0x280
Dec 13 00:34:32 Misha kernel: [<90000000032aec9c>] kthread+0x11c/0x140
Dec 13 00:34:32 Misha kernel: [<9000000003252208>] ret_from_kernel_thread+0xc/0xa4
Dec 13 00:34:32 Misha kernel:
Dec 13 00:34:52 Misha sshd[33192]: Accepted password for lain from 192.168.1.4 port 50826 ssh2
Dec 13 00:34:52 Misha audit[33192]: SYSCALL arch=c0000102 syscall=64 success=yes exit=4 a0=3 a1=7ffffb82e7c0 a2=4 a3=0 items=0 ppid=891 pid=33192 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgi>
Dec 13 00:34:52 Misha audit: PROCTITLE proctitle=737368643A206C61696E205B707269765D
Dec 13 00:34:52 Misha kernel: audit: type=1006 audit(1702398892.302:235): pid=33192 uid=0 subj=unconfined old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=7 res=1
Dec 13 00:34:52 Misha kernel: audit: type=1300 audit(1702398892.302:235): arch=c0000102 syscall=64 success=yes exit=4 a0=3 a1=7ffffb82e7c0 a2=4 a3=0 items=0 ppid=891 pid=33192 auid=1000 uid=0 gid=0 euid=0 sui>
Dec 13 00:34:52 Misha kernel: audit: type=1327 audit(1702398892.302:235): proctitle=737368643A206C61696E205B707269765D
Dec 13 00:34:52 Misha sshd[33192]: pam_unix(system-remote-login:session): session opened for user lain(uid=1000) by (uid=0)
Dec 13 00:34:52 Misha systemd-logind[842]: New session 7 of user lain.
Dec 13 00:35:14 Misha kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [gdbus:21686]
Dec 13 00:35:14 Misha kernel: Modules linked in: qrtr snd_hda_codec_conexant snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep kvm spi_>
Dec 13 00:35:14 Misha kernel: CPU: 5 PID: 21686 Comm: gdbus Not tainted 6.7.0-aosc-main #1
Dec 13 00:35:14 Misha kernel: Hardware name: Loongson Loongson-3A6000-HV-7A2000-1w-V0.1-EVB/Loongson-3A6000-HV-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05636-stable2
Dec 13 00:35:14 Misha kernel: pc 900000000338c9f4 ra 900000000338cb7c tp 9000000190584000 sp 9000000190587b40
Dec 13 00:35:14 Misha kernel: a0 0000000000000000 a1 0000000000000000 a2 0000000000000000 a3 0000000000000000
Dec 13 00:35:14 Misha kernel: a4 0000000000000000 a5 0000000000000000 a6 0000000000000000 a7 0000000000000000
Dec 13 00:35:14 Misha kernel: t0 0000000000000001 t1 900000000a831320 t2 0000000000000002 t3 0000000000000000
Dec 13 00:35:14 Misha kernel: t4 9000000005080000 t5 0000000000000040 t6 0000000000000001 t7 0000000000000000
Dec 13 00:35:14 Misha kernel: t8 0000000000000000 u0 900000010a440c00 s9 900000000bc31320 s0 0000000000000005
Dec 13 00:35:14 Misha kernel: s1 00000000000000b4 s2 9000000003261580 s3 0000000000000001 s4 900000000507ffd8
Dec 13 00:35:14 Misha kernel: s5 0000000000000004 s6 0000000000000001 s7 0000000000000000 s8 900000000b42b200
Dec 13 00:35:14 Misha kernel:    ra: 900000000338cb7c smp_call_function_many_cond+0x3fc/0x720
Dec 13 00:35:14 Misha kernel:   ERA: 900000000338c9f4 smp_call_function_many_cond+0x274/0x720
Dec 13 00:35:14 Misha kernel:  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
Dec 13 00:35:14 Misha kernel:  PRMD: 00000004 (PPLV0 +PIE -PWE)
Dec 13 00:35:14 Misha kernel:  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
Dec 13 00:35:14 Misha kernel:  ECFG: 00071c1c (LIE=2-4,10-12 VS=7)
Dec 13 00:35:14 Misha kernel: ESTAT: 00000800 [INT] (IS=11 ECode=0 EsubCode=0)
Dec 13 00:35:14 Misha kernel:  PRID: 0014d000 (Loongson-64bit, Loongson-3A6000-HV)
Dec 13 00:35:14 Misha kernel: CPU: 5 PID: 21686 Comm: gdbus Not tainted 6.7.0-aosc-main #1
Dec 13 00:35:14 Misha kernel: Hardware name: Loongson Loongson-3A6000-HV-7A2000-1w-V0.1-EVB/Loongson-3A6000-HV-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05636-stable2
Dec 13 00:35:14 Misha kernel: Stack : 9000000004e391a0 900000010040bcb8 9000000004864898 9000000190584000
Dec 13 00:35:14 Misha kernel:         900000010040bc00 0000000000000000 900000010040bc08 9000000004e391a0
Dec 13 00:35:14 Misha kernel:         0000000000000000 0000000000000000 0000000000000000 0000000000000000
Dec 13 00:35:14 Misha kernel:         0000000000000000 86a32dec93ad0052 0000000000000000 0000000000000000
Dec 13 00:35:14 Misha kernel:         0000000000000000 0000000000000000 0000000000000000 0000000000000000
Dec 13 00:35:14 Misha kernel:         732d36333635302e 0000000000000000 0000000006a60000 900000010040bdf0
Dec 13 00:35:14 Misha kernel:         9000000005080000 9000000004e391a0 0000000000000000 0000000000000004
Dec 13 00:35:14 Misha kernel:         0000000000000000 0000000000000016 900000000507ffd8 90000000049a4058
Dec 13 00:35:14 Misha kernel:         900000000b403940 9000000005080580 9000000003254520 00007fffdf7fafc8
Dec 13 00:35:14 Misha kernel:         00000000000000b0 0000000000000004 0000000000000000 0000000000071c1c
Dec 13 00:35:14 Misha kernel:         ...
Dec 13 00:35:14 Misha kernel: Call Trace:
Dec 13 00:35:14 Misha kernel: [<9000000003254520>] show_stack+0x40/0x180
Dec 13 00:35:14 Misha kernel: [<9000000004864898>] dump_stack_lvl+0x78/0xc4
Dec 13 00:35:14 Misha kernel: [<90000000033ccf64>] watchdog_timer_fn+0x2c4/0x340
Dec 13 00:35:14 Misha kernel: [<900000000336da7c>] __hrtimer_run_queues+0x15c/0x400
Dec 13 00:35:14 Misha kernel: [<900000000336f088>] hrtimer_interrupt+0x128/0x2e0
Dec 13 00:35:14 Misha kernel: [<90000000032579fc>] constant_timer_interrupt+0x3c/0x60
Dec 13 00:35:14 Misha kernel: [<9000000003320010>] __handle_irq_event_percpu+0xb0/0x300
Dec 13 00:35:14 Misha kernel: [<9000000003320280>] handle_irq_event_percpu+0x20/0xa0
Dec 13 00:35:14 Misha kernel: [<9000000003327ff4>] handle_percpu_irq+0x74/0xc0
Dec 13 00:35:14 Misha kernel: [<900000000331ee50>] generic_handle_domain_irq+0x30/0x60
Dec 13 00:35:14 Misha kernel: [<9000000003e47ff0>] handle_cpu_irq+0x70/0xc0
Dec 13 00:35:14 Misha kernel: [<9000000004864c90>] handle_loongarch_irq+0x30/0x60
Dec 13 00:35:14 Misha kernel: [<9000000004864d60>] do_vint+0xa0/0x100
Dec 13 00:35:14 Misha kernel: [<900000000338c9f4>] smp_call_function_many_cond+0x274/0x720
Dec 13 00:35:14 Misha kernel: [<900000000338cff8>] on_each_cpu_cond_mask+0x58/0xe0
Dec 13 00:35:14 Misha kernel: [<9000000003261880>] flush_tlb_page+0x80/0x1e0
Dec 13 00:35:14 Misha kernel: [<900000000358b3a4>] ptep_set_access_flags+0x84/0xc0
Dec 13 00:35:14 Misha kernel: [<9000000003570fb4>] do_wp_page+0x114/0x1380
Dec 13 00:35:14 Misha kernel: [<900000000357629c>] __handle_mm_fault+0x8dc/0x15c0
Dec 13 00:35:14 Misha kernel: [<9000000003577120>] handle_mm_fault+0x1a0/0x320
Dec 13 00:35:14 Misha kernel: [<900000000487b558>] do_page_fault+0x158/0x3ec
Dec 13 00:35:14 Misha kernel: [<900000000326acb8>] tlb_do_page_fault_1+0x118/0x1b4
Dec 13 00:35:14 Misha kernel:
Dec 13 00:35:42 Misha kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Dec 13 00:35:42 Misha kernel: rcu:         2-...!: (0 ticks this GP) idle=6388/0/0x0 softirq=494920/494920 fqs=0 (false positive?)
Dec 13 00:35:42 Misha kernel: rcu:         3-...!: (1 ticks this GP) idle=bb6c/1/0x4000000000000000 softirq=493654/493655 fqs=0
Dec 13 00:35:42 Misha kernel: rcu:         (detected by 4, t=84105 jiffies, g=1352673, q=376 ncpus=8)
Dec 13 00:35:42 Misha kernel: Sending NMI from CPU 4 to CPUs 2:
Dec 13 00:35:42 Misha kernel: Unable to send backtrace IPI to CPU2 - perhaps it hung?
Dec 13 00:35:42 Misha kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 48s! [gdbus:21686]

KatyushaScarlet commented 6 months ago

我也遇到了类似问题

附带两份日志:

3a6000-evb-rx590-aosc-6.7.0-amdgpu-dmesg.log 3a6000-evb-rx590-aosc-6.7.0-amdgpu-dmesg-2.log

phorcys commented 6 months ago

参考:https://bbs.loongarch.org/d/327-amdgpu/4

[LiarOnce](https://bbs.loongarch.org/u/451)
    19 天前
    已编辑

目前更新了 https://github.com/loongson/Firmware/tree/main/6000Series/PC/XA61200 的固件然后关闭 DPM 运行就正常了

内核参数参考:

GRUB_CMDLINE_LINUX="radeon.cik_support=0 radeon.si_support=0 amdgpu.cik_support=1 amdgpu.si_support=1 amdgpu.sg_display=0 amdgpu.runpm=0 amdgpu.dpm=0"
Fearyncess commented 5 months ago

参考:https://bbs.loongarch.org/d/327-amdgpu/4

[LiarOnce](https://bbs.loongarch.org/u/451)
    19 天前
    已编辑

目前更新了 https://github.com/loongson/Firmware/tree/main/6000Series/PC/XA61200 的固件然后关闭 DPM 运行就正常了

内核参数参考:

GRUB_CMDLINE_LINUX="radeon.cik_support=0 radeon.si_support=0 amdgpu.cik_support=1 amdgpu.si_support=1 amdgpu.sg_display=0 amdgpu.runpm=0 amdgpu.dpm=0"

@phorcys 如果关闭DPM,那么显卡将不会自动调频,这会导致gpu工作频率降低

LinuxResearcher commented 5 months ago

参考:https://bbs.loongarch.org/d/327-amdgpu/4

[LiarOnce](https://bbs.loongarch.org/u/451)
    19 天前
    已编辑

目前更新了 https://github.com/loongson/Firmware/tree/main/6000Series/PC/XA61200 的固件然后关闭 DPM 运行就正常了

内核参数参考:

GRUB_CMDLINE_LINUX="radeon.cik_support=0 radeon.si_support=0 amdgpu.cik_support=1 amdgpu.si_support=1 amdgpu.sg_display=0 amdgpu.runpm=0 amdgpu.dpm=0"

我加上这一串参数后,感觉显示变卡了。

xry111 commented 5 months ago

参考:https://bbs.loongarch.org/d/327-amdgpu/4

[LiarOnce](https://bbs.loongarch.org/u/451)
    19 天前
    已编辑

目前更新了 https://github.com/loongson/Firmware/tree/main/6000Series/PC/XA61200 的固件然后关闭 DPM 运行就正常了

内核参数参考:

GRUB_CMDLINE_LINUX="radeon.cik_support=0 radeon.si_support=0 amdgpu.cik_support=1 amdgpu.si_support=1 amdgpu.sg_display=0 amdgpu.runpm=0 amdgpu.dpm=0"

我加上这一串参数后,感觉显示变卡了。

这个问题就是越快,越高级的卡越容易发生,一切能让卡变慢的方法都能降低概率。

LiarOnce commented 4 months ago

对于RX550这样的北极星架构的卡其实是不太建议用我的这个内核参数的,这些参数对GCN 1.0/2.0架构生效,因为我使用的是一块R5 340 (GCN 1.0 Oland)的显卡。

过几天我会买一块RX560的显卡继续测试一下

dg1vg4 commented 1 week ago

如今这事终于是确定了。

Fearyncess commented 1 week ago

其实并没有,我自己一开始的测试发现到后来有一些旗舰性能gcn显卡用户的测试结果,表明了这个问题从7A桥片出现到现在,一直都是存在的。并且根据一些不太适宜公开的讨论以及chenhuacai老师的补丁提交信息(详见https://github.com/chenhuacai/linux/commit/741913c04d00072229330fe51862730339935fb4 ),可以得出这个问题无法在7A桥片上被彻底解决,只能尽可能做mitigation以降低问题出现的概率。


发件人: dg1vg4 @.> 发送时间: 2024年6月27日 22:32 收件人: loongson/Firmware @.> 抄送: Lain Yang @.>; Author @.> 主题: Re: [loongson/Firmware] XA61200主板上搭配RX550独显出现cpu死锁现象 (Issue #83)

如今这事终于是确定了。

― Reply to this email directly, view it on GitHubhttps://github.com/loongson/Firmware/issues/83#issuecomment-2194893549, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMXLODSIKYO2KG3PI6OLCULZJQPAXAVCNFSM6AAAAABASPBOSWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJUHA4TGNJUHE. You are receiving this because you authored the thread.Message ID: @.***>