Open w23 opened 3 days ago
Random failures in well-tested, core kernel code is often a sign of power/voltage problems. How are you powering this Pi 4? And what else is attached?
Random failures in well-tested, core kernel code is often a sign of power/voltage problems. How are you powering this Pi 4? And what else is attached?
It's powered using one of the recommended CanaKit power supply. The issue appears only when enabling performance counters. Other than that the board is stable. During the past several weeks I've been running a (reasonably heavy) OpenGL ES app on it for hours, compiling relatively big C++ code bases, etc. All without issues. It's only when I'm enabling v3d perf counters for the very same app it will crash on 2nd-3rd run.
Note that for reproduction steps the bare and simple kmscube app results in the same kind of crash.
I've been also monitoring temperature and throttling. Temperature doesn't get higher than 45-47C, and there's no throttling reported.
I could reproduce this (on 6.6 kernel).
@jasuarez - I believe you added the v3d perfmon interface. Is it possible there is an overflow when too many performance counters are added?
Interesting... I don't think it is a problem with too many performance counters. In your trace the problem happens when destroying the perfmon, not when creating it.
I managed to reproduce the issue both in rpi4 and rpi5 (I need to ensure to export GALLIUM_HUD_VISIBLE and GALLIUM_HUD envvars).
The stack trace I get is totally different:
[ 233.383300] CPU: 3 PID: 164 Comm: v3d_bin Tainted: G C 6.6.47+rpt-rpi-v8 #1 Debian 1:6.6.47-1+rpt1
[ 233.383325] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[ 233.383339] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 233.383357] pc : __mutex_lock.constprop.0+0x90/0x608
[ 233.383381] lr : __mutex_lock.constprop.0+0x58/0x608
[ 233.383396] sp : ffffffc08063bcf0
[ 233.383406] x29: ffffffc08063bcf0 x28: 0000000000000000 x27: ffffff8106168a28
[ 233.383428] x26: ffffff8101bad840 x25: ffffff8103283540 x24: ffffffe14af26148
[ 233.383449] x23: ffffffe1b9aa4008 x22: 0000000000000002 x21: ffffffc08063bd38
[ 233.383471] x20: ffffff8106163d80 x19: ffffff8103aae308 x18: 0000000000000000
[ 233.383491] x17: 0000000000000000 x16: ffffffe1b9504d18 x15: 00000055a438da50
[ 233.383512] x14: 01a39a8ac0758558 x13: 0000000000000001 x12: ffffffe1b954cbb0
[ 233.383533] x11: 00000000f5257d14 x10: 0000000000001a40 x9 : ffffffe1b9504d04
[ 233.383554] x8 : ffffff8102851e00 x7 : 0000000000000000 x6 : 00000000032dedc4
[ 233.383574] x5 : 00ffffffffffffff x4 : 0000000000000088 x3 : 0000000000000088
[ 233.383595] x2 : ffffff8106163d80 x1 : 0000000000000021 x0 : 0000000000000088
[ 233.383616] Call trace:
[ 233.383626] __mutex_lock.constprop.0+0x90/0x608
[ 233.383642] __mutex_lock_slowpath+0x1c/0x30
[ 233.383657] mutex_lock+0x50/0x68
[ 233.383669] v3d_perfmon_stop+0x40/0xe0 [v3d]
[ 233.383704] v3d_bin_job_run+0x10c/0x2d8 [v3d]
[ 233.383729] drm_sched_main+0x178/0x3f8 [gpu_sched]
[ 233.383755] kthread+0x11c/0x128
[ 233.383773] ret_from_fork+0x10/0x20
[ 233.383790] Code: f9400260 f1001c1f 54001ea9 927df000 (b9403401)
[ 233.383807] ---[ end trace 0000000000000000 ]---
[ 233.383852] note: v3d_bin[164] exited with preempt_count 1
In a second retry I got totally different backtrace (too long to paste here).
But I suspect we could trying to access an invalid memory address. In both cases we get a "Unable to handle kernel paging request at virtual address".
I'm ccing @mairacanal who also works in the kernel, to see if she knows better what's going on here.
I can reproduce the issue even with a few of the counters:
GALLIUM_HUD=stdout
GALLIUM_HUD+=,fps
GALLIUM_HUD+=,frametime
GALLIUM_HUD+=,cpu
GALLIUM_HUD+=,samples-passed
GALLIUM_HUD+=,primitives-generated
GALLIUM_HUD+=,PTB-primitives-discarded-outside-viewport
I get again the same stack trace I got the first time. Worth to note that before that stack trace there is an Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000, which looks like we are trying to access an invalid address.
Also, this time I've tested with an older kernel, 6.6.31, and still get the same issue.
Describe the bug
Using v3d performance counters, e.g. from
GALLIUM_HUD
leads to random kernel panics after a few runs.Steps to reproduce the behaviour
GALLIUM_HUD
to show some performance counters. E.g like this:Not sure if any specific counter, or combination of, is causing this. Using just a single counter doesn't seem to crash even after several tries. Enabling 10-20 counters crashes on a second run.
Device (s)
Raspberry Pi 4 Mod. B
System
Logs
Usually panic messages and stacks are completely unrelated, I've seen from usb and ext4 call stacks. One that might be relevant:
Additional context
No response