raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.17k stars 5.01k forks source link

Using v3d performance counters causes kernel panic #6389

Closed w23 closed 1 month ago

w23 commented 1 month ago

Describe the bug

Using v3d performance counters, e.g. from GALLIUM_HUD leads to random kernel panics after a few runs.

Steps to reproduce the behaviour

  1. Get and build https://gitlab.freedesktop.org/mesa/kmscube/
  2. Set GALLIUM_HUD to show some performance counters. E.g like this:
    GALLIUM_HUD_VISIBLE=false
    GALLIUM_HUD=stdout
    GALLIUM_HUD+=,fps
    GALLIUM_HUD+=,frametime
    GALLIUM_HUD+=,cpu
    GALLIUM_HUD+=,samples-passed
    GALLIUM_HUD+=,primitives-generated
    GALLIUM_HUD+=,PTB-primitives-discarded-outside-viewport
    GALLIUM_HUD+=,QPU-total-idle-clk-cycles
    GALLIUM_HUD+=,QPU-total-active-clk-cycles-vertex-coord-shading
    GALLIUM_HUD+=,QPU-total-active-clk-cycles-fragment-shading
    GALLIUM_HUD+=,QPU-total-clk-cycles-executing-valid-instr
    GALLIUM_HUD+=,QPU-total-clk-cycles-waiting-TMU
    GALLIUM_HUD+=,QPU-total-clk-cycles-waiting-varyings
    GALLIUM_HUD+=,QPU-total-instr-cache-hit
    GALLIUM_HUD+=,QPU-total-instr-cache-miss
    GALLIUM_HUD+=,TMU-total-text-quads-access
    GALLIUM_HUD+=,TMU-total-text-cache-miss
    GALLIUM_HUD+=,L2T-total-cache-hit
    GALLIUM_HUD+=,L2T-total-cache-miss
    GALLIUM_HUD+=,QPU-total-clk-cycles-waiting-vertex-coord-shading
    GALLIUM_HUD+=,QPU-total-clk-cycles-waiting-fragment-shading
    GALLIUM_HUD+=,TLB-partial-quads-written-to-color-buffer
    GALLIUM_HUD+=,TMU-active-cycles
    GALLIUM_HUD+=,TMU-stalled-cycles
    GALLIUM_HUD+=,L2T-TMU-reads
    GALLIUM_HUD+=,L2T-TMU-write-miss
    GALLIUM_HUD+=,L2T-TMU-read-miss
    GALLIUM_HUD+=,TMU-MRU-hits
    ./build/kmscube
  3. Run kmscube a few times for a few seconds each.
  4. After 3-4th time kernel will panic at some random and often unrelated location. I've seen stacks from drm+v3d, ext4, usb, etc. It's completely arbitrary.

Not sure if any specific counter, or combination of, is causing this. Using just a single counter doesn't seem to crash even after several tries. Enabling 10-20 counters crashes on a second run.

Device (s)

Raspberry Pi 4 Mod. B

System

Raspberry Pi reference 2024-07-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 0b115f302a8f1e5bd3523614d7f45b9d447434c7, stage2

Aug 30 2024 19:17:39
Copyright (c) 2012 Broadcom
version 2808975b80149bbfe86844655fe45c7de66fc078 (clean) (release) (start)

Linux lgpi00 6.6.47+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux

Logs

Usually panic messages and stacks are completely unrelated, I've seen from usb and ext4 call stacks. One that might be relevant:

[  391.090878] ------------[ cut here ]------------
[  391.090891] WARNING: CPU: 0 PID: 1188 at mm/slab_common.c:994 free_large_kmalloc+0x78/0xb8
[  391.090916] Modules linked in: cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep brcmfmac_wcc brcmfmac vc4 brcmutil cfg80211 binfmt_misc snd_soc_hdmi_codec hci_uart uvcvideo drm_display_helper b
tbcm cec bluetooth uvc rpivid_hevc(C) bcm2835_codec(C) drm_dma_helper v3d drm_kms_helper bcm2835_isp(C) v4l2_mem2mem bcm2835_v4l2(C) bcm2835_mmal_vchiq(C) raspberrypi_hwmon videobuf2_vmalloc videobuf2_dma_contig
gpu_sched snd_soc_core drm_shmem_helper ecdh_generic ecc videobuf2_memops rfkill videobuf2_v4l2 videodev snd_compress snd_pcm_dmaengine libaes raspberrypi_gpiomem snd_bcm2835(C) vc_sm_cma(C) snd_pcm videobuf2_com
mon snd_timer snd mc nvmem_rmem uio_pdrv_genirq uio drm fuse dm_mod drm_panel_orientation_quirks backlight ip_tables x_tables ipv6 i2c_brcmstb
[  391.091067] CPU: 0 PID: 1188 Comm: kmscube Tainted: G         C         6.6.47+rpt-rpi-v8 #1  Debian 1:6.6.47-1+rpt1
[  391.091075] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[  391.091079] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  391.091085] pc : free_large_kmalloc+0x78/0xb8
[  391.091092] lr : kfree+0x134/0x140
[  391.091098] sp : ffffffc0800fbb60
[  391.091101] x29: ffffffc0800fbb60 x28: ffffffe1ace63b48 x27: ffffffc0800fbcf8
[  391.091111] x26: ffffffc0800fbcf8 x25: ffffff804174b400 x24: 0000000000000049
[  391.091121] x23: ffffffc0800fbcf8 x22: ffffffe1acf050a0 x21: ffffffe1acf04a84
[  391.091130] x20: ffffff8044873820 x19: fffffffe010a8c40 x18: 0000000000000000
[  391.091140] x17: 0000000000000000 x16: ffffffe2126d4760 x15: 00000000ffeaffc0
[  391.091149] x14: 0000000000000004 x13: ffffff8044873808 x12: 0000000000000000
[  391.091158] x11: ffffff804b1b1de8 x10: ffffff804b1b1da8 x9 : ffffffe2126d4894
[  391.091167] x8 : ffffff804b1b1dd0 x7 : 0000000000000000 x6 : 0000000000000228
[  391.091175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : fffffffe010a8c40
[  391.091184] x2 : 0000000000000001 x1 : ffffff8042a31498 x0 : 4000000000000000
[  391.091193] Call trace:
[  391.091196]  free_large_kmalloc+0x78/0xb8
[  391.091203]  kfree+0x134/0x140
[  391.091209]  v3d_perfmon_put.part.0+0x64/0x90 [v3d]
[  391.091237]  v3d_perfmon_destroy_ioctl+0x54/0x80 [v3d]
[  391.091254]  drm_ioctl_kernel+0xd8/0x190 [drm]
[  391.091378]  drm_ioctl+0x220/0x4c0 [drm]
[  391.091462]  drm_compat_ioctl+0x118/0x140 [drm]
[  391.091546]  __arm64_compat_sys_ioctl+0x158/0x180
[  391.091558]  invoke_syscall+0x50/0x128
[  391.091567]  el0_svc_common.constprop.0+0x48/0xf0
[  391.091574]  do_el0_svc_compat+0x24/0x48
[  391.091581]  el0_svc_compat+0x30/0x88
[  391.091591]  el0t_32_sync_handler+0x98/0x140
[  391.091595]  el0t_32_sync+0x194/0x198
[  391.091601] ---[ end trace 0000000000000000 ]---
[  391.091950] object pointer: 0x000000007ee5e213

Additional context

No response

pelwell commented 1 month ago

Random failures in well-tested, core kernel code is often a sign of power/voltage problems. How are you powering this Pi 4? And what else is attached?

w23 commented 1 month ago

Random failures in well-tested, core kernel code is often a sign of power/voltage problems. How are you powering this Pi 4? And what else is attached?

It's powered using one of the recommended CanaKit power supply. The issue appears only when enabling performance counters. Other than that the board is stable. During the past several weeks I've been running a (reasonably heavy) OpenGL ES app on it for hours, compiling relatively big C++ code bases, etc. All without issues. It's only when I'm enabling v3d perf counters for the very same app it will crash on 2nd-3rd run.

Note that for reproduction steps the bare and simple kmscube app results in the same kind of crash.

w23 commented 1 month ago

I've been also monitoring temperature and throttling. Temperature doesn't get higher than 45-47C, and there's no throttling reported.

popcornmix commented 1 month ago

I could reproduce this (on 6.6 kernel).

@jasuarez - I believe you added the v3d perfmon interface. Is it possible there is an overflow when too many performance counters are added?

jasuarez commented 1 month ago

Interesting... I don't think it is a problem with too many performance counters. In your trace the problem happens when destroying the perfmon, not when creating it.

I managed to reproduce the issue both in rpi4 and rpi5 (I need to ensure to export GALLIUM_HUD_VISIBLE and GALLIUM_HUD envvars).

The stack trace I get is totally different:

[  233.383300] CPU: 3 PID: 164 Comm: v3d_bin Tainted: G         C         6.6.47+rpt-rpi-v8 #1  Debian 1:6.6.47-1+rpt1
[  233.383325] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[  233.383339] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  233.383357] pc : __mutex_lock.constprop.0+0x90/0x608
[  233.383381] lr : __mutex_lock.constprop.0+0x58/0x608
[  233.383396] sp : ffffffc08063bcf0
[  233.383406] x29: ffffffc08063bcf0 x28: 0000000000000000 x27: ffffff8106168a28
[  233.383428] x26: ffffff8101bad840 x25: ffffff8103283540 x24: ffffffe14af26148
[  233.383449] x23: ffffffe1b9aa4008 x22: 0000000000000002 x21: ffffffc08063bd38
[  233.383471] x20: ffffff8106163d80 x19: ffffff8103aae308 x18: 0000000000000000
[  233.383491] x17: 0000000000000000 x16: ffffffe1b9504d18 x15: 00000055a438da50
[  233.383512] x14: 01a39a8ac0758558 x13: 0000000000000001 x12: ffffffe1b954cbb0
[  233.383533] x11: 00000000f5257d14 x10: 0000000000001a40 x9 : ffffffe1b9504d04
[  233.383554] x8 : ffffff8102851e00 x7 : 0000000000000000 x6 : 00000000032dedc4
[  233.383574] x5 : 00ffffffffffffff x4 : 0000000000000088 x3 : 0000000000000088
[  233.383595] x2 : ffffff8106163d80 x1 : 0000000000000021 x0 : 0000000000000088
[  233.383616] Call trace:
[  233.383626]  __mutex_lock.constprop.0+0x90/0x608
[  233.383642]  __mutex_lock_slowpath+0x1c/0x30
[  233.383657]  mutex_lock+0x50/0x68
[  233.383669]  v3d_perfmon_stop+0x40/0xe0 [v3d]
[  233.383704]  v3d_bin_job_run+0x10c/0x2d8 [v3d]
[  233.383729]  drm_sched_main+0x178/0x3f8 [gpu_sched]
[  233.383755]  kthread+0x11c/0x128
[  233.383773]  ret_from_fork+0x10/0x20
[  233.383790] Code: f9400260 f1001c1f 54001ea9 927df000 (b9403401) 
[  233.383807] ---[ end trace 0000000000000000 ]---
[  233.383852] note: v3d_bin[164] exited with preempt_count 1

In a second retry I got totally different backtrace (too long to paste here).

But I suspect we could trying to access an invalid memory address. In both cases we get a "Unable to handle kernel paging request at virtual address".

I'm ccing @mairacanal who also works in the kernel, to see if she knows better what's going on here.

jasuarez commented 1 month ago

I can reproduce the issue even with a few of the counters:

GALLIUM_HUD=stdout
GALLIUM_HUD+=,fps
GALLIUM_HUD+=,frametime
GALLIUM_HUD+=,cpu
GALLIUM_HUD+=,samples-passed
GALLIUM_HUD+=,primitives-generated
GALLIUM_HUD+=,PTB-primitives-discarded-outside-viewport

I get again the same stack trace I got the first time. Worth to note that before that stack trace there is an Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000, which looks like we are trying to access an invalid address.

Also, this time I've tested with an older kernel, 6.6.31, and still get the same issue.

mairacanal commented 1 month ago

I was able to reproduce the issue in the RPi 4 and RPi 5. After some debugging, I determined that the issue was related to an invalid memory access by V3D.

I wrote a patch fixing the issue, which is available in this branch (it is the first patch). This patch solved the kernel crashes I had on both machines (RPi 4 and 5).

@w23 @jasuarez @popcornmix Could you also test the patch? I want to check if the crashes are also fixed on your side.

popcornmix commented 1 month ago

Looks good here. Was panicking after typically 3 runs for me. With update has got though dozens of runs without a panic. (tested on Pi5).

mairacanal commented 1 month ago

I slightly changed the patch as the previous one had an issue that if the active performance monitor didn't belong to the process that we were terminating, it would stop it anyway. I wrote a slightly different version of the patch with the same goal and it fixes the problem on RPi 4 and 5.

I sent the fix to this issue to dri-devel mailing list: https://lore.kernel.org/dri-devel/20241004130625.918580-2-mcanal@igalia.com/T/

Also, as vc4's perfmon code is similar, I sent a fix for vc4 as well: https://lore.kernel.org/dri-devel/20241004123817.890016-1-mcanal@igalia.com/T/

@popcornmix let me know if you want me to send a PR with the patches. They will eventually land on 6.6 through the stable branch, but maybe you want to have the patches quickly in the branch. Anyway, let me know.

popcornmix commented 1 month ago

@mairacanal a PR would be welcome.

jasuarez commented 1 month ago

I've tested the fix and indeed fixes the issue.

I'll add a R-b in the mailing list. Thanks!

w23 commented 1 month ago

Can confirm that this patch https://github.com/intel-lab-lkp/linux/commit/074b3b5a2a976528b1262acbf519298fb8c6eac8 applied against the latest release kernel (6.6.51+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.51-1+rpt2 (2024-10-01) aarch64 GNU/Linux, commit 82a50e430ef1) fixes the issue for me. Been running profiling suite in a loop for some time now, with almost a hundred start-stop cycles.

Thank you @mairacanal for your work, and for fixing this issue so quickly!