Closed w23 closed 1 month ago
Random failures in well-tested, core kernel code is often a sign of power/voltage problems. How are you powering this Pi 4? And what else is attached?
Random failures in well-tested, core kernel code is often a sign of power/voltage problems. How are you powering this Pi 4? And what else is attached?
It's powered using one of the recommended CanaKit power supply. The issue appears only when enabling performance counters. Other than that the board is stable. During the past several weeks I've been running a (reasonably heavy) OpenGL ES app on it for hours, compiling relatively big C++ code bases, etc. All without issues. It's only when I'm enabling v3d perf counters for the very same app it will crash on 2nd-3rd run.
Note that for reproduction steps the bare and simple kmscube app results in the same kind of crash.
I've been also monitoring temperature and throttling. Temperature doesn't get higher than 45-47C, and there's no throttling reported.
I could reproduce this (on 6.6 kernel).
@jasuarez - I believe you added the v3d perfmon interface. Is it possible there is an overflow when too many performance counters are added?
Interesting... I don't think it is a problem with too many performance counters. In your trace the problem happens when destroying the perfmon, not when creating it.
I managed to reproduce the issue both in rpi4 and rpi5 (I need to ensure to export GALLIUM_HUD_VISIBLE and GALLIUM_HUD envvars).
The stack trace I get is totally different:
[ 233.383300] CPU: 3 PID: 164 Comm: v3d_bin Tainted: G C 6.6.47+rpt-rpi-v8 #1 Debian 1:6.6.47-1+rpt1
[ 233.383325] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[ 233.383339] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 233.383357] pc : __mutex_lock.constprop.0+0x90/0x608
[ 233.383381] lr : __mutex_lock.constprop.0+0x58/0x608
[ 233.383396] sp : ffffffc08063bcf0
[ 233.383406] x29: ffffffc08063bcf0 x28: 0000000000000000 x27: ffffff8106168a28
[ 233.383428] x26: ffffff8101bad840 x25: ffffff8103283540 x24: ffffffe14af26148
[ 233.383449] x23: ffffffe1b9aa4008 x22: 0000000000000002 x21: ffffffc08063bd38
[ 233.383471] x20: ffffff8106163d80 x19: ffffff8103aae308 x18: 0000000000000000
[ 233.383491] x17: 0000000000000000 x16: ffffffe1b9504d18 x15: 00000055a438da50
[ 233.383512] x14: 01a39a8ac0758558 x13: 0000000000000001 x12: ffffffe1b954cbb0
[ 233.383533] x11: 00000000f5257d14 x10: 0000000000001a40 x9 : ffffffe1b9504d04
[ 233.383554] x8 : ffffff8102851e00 x7 : 0000000000000000 x6 : 00000000032dedc4
[ 233.383574] x5 : 00ffffffffffffff x4 : 0000000000000088 x3 : 0000000000000088
[ 233.383595] x2 : ffffff8106163d80 x1 : 0000000000000021 x0 : 0000000000000088
[ 233.383616] Call trace:
[ 233.383626] __mutex_lock.constprop.0+0x90/0x608
[ 233.383642] __mutex_lock_slowpath+0x1c/0x30
[ 233.383657] mutex_lock+0x50/0x68
[ 233.383669] v3d_perfmon_stop+0x40/0xe0 [v3d]
[ 233.383704] v3d_bin_job_run+0x10c/0x2d8 [v3d]
[ 233.383729] drm_sched_main+0x178/0x3f8 [gpu_sched]
[ 233.383755] kthread+0x11c/0x128
[ 233.383773] ret_from_fork+0x10/0x20
[ 233.383790] Code: f9400260 f1001c1f 54001ea9 927df000 (b9403401)
[ 233.383807] ---[ end trace 0000000000000000 ]---
[ 233.383852] note: v3d_bin[164] exited with preempt_count 1
In a second retry I got totally different backtrace (too long to paste here).
But I suspect we could trying to access an invalid memory address. In both cases we get a "Unable to handle kernel paging request at virtual address".
I'm ccing @mairacanal who also works in the kernel, to see if she knows better what's going on here.
I can reproduce the issue even with a few of the counters:
GALLIUM_HUD=stdout
GALLIUM_HUD+=,fps
GALLIUM_HUD+=,frametime
GALLIUM_HUD+=,cpu
GALLIUM_HUD+=,samples-passed
GALLIUM_HUD+=,primitives-generated
GALLIUM_HUD+=,PTB-primitives-discarded-outside-viewport
I get again the same stack trace I got the first time. Worth to note that before that stack trace there is an Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000, which looks like we are trying to access an invalid address.
Also, this time I've tested with an older kernel, 6.6.31, and still get the same issue.
I was able to reproduce the issue in the RPi 4 and RPi 5. After some debugging, I determined that the issue was related to an invalid memory access by V3D.
I wrote a patch fixing the issue, which is available in this branch (it is the first patch). This patch solved the kernel crashes I had on both machines (RPi 4 and 5).
@w23 @jasuarez @popcornmix Could you also test the patch? I want to check if the crashes are also fixed on your side.
Looks good here. Was panicking after typically 3 runs for me. With update has got though dozens of runs without a panic. (tested on Pi5).
I slightly changed the patch as the previous one had an issue that if the active performance monitor didn't belong to the process that we were terminating, it would stop it anyway. I wrote a slightly different version of the patch with the same goal and it fixes the problem on RPi 4 and 5.
I sent the fix to this issue to dri-devel mailing list: https://lore.kernel.org/dri-devel/20241004130625.918580-2-mcanal@igalia.com/T/
Also, as vc4's perfmon code is similar, I sent a fix for vc4 as well: https://lore.kernel.org/dri-devel/20241004123817.890016-1-mcanal@igalia.com/T/
@popcornmix let me know if you want me to send a PR with the patches. They will eventually land on 6.6 through the stable branch, but maybe you want to have the patches quickly in the branch. Anyway, let me know.
@mairacanal a PR would be welcome.
I've tested the fix and indeed fixes the issue.
I'll add a R-b in the mailing list. Thanks!
Can confirm that this patch https://github.com/intel-lab-lkp/linux/commit/074b3b5a2a976528b1262acbf519298fb8c6eac8 applied against the latest release kernel (6.6.51+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.51-1+rpt2 (2024-10-01) aarch64 GNU/Linux
, commit 82a50e430ef1) fixes the issue for me. Been running profiling suite in a loop for some time now, with almost a hundred start-stop cycles.
Thank you @mairacanal for your work, and for fixing this issue so quickly!
Describe the bug
Using v3d performance counters, e.g. from
GALLIUM_HUD
leads to random kernel panics after a few runs.Steps to reproduce the behaviour
GALLIUM_HUD
to show some performance counters. E.g like this:Not sure if any specific counter, or combination of, is causing this. Using just a single counter doesn't seem to crash even after several tries. Enabling 10-20 counters crashes on a second run.
Device (s)
Raspberry Pi 4 Mod. B
System
Logs
Usually panic messages and stacks are completely unrelated, I've seen from usb and ext4 call stacks. One that might be relevant:
Additional context
No response