Open mjevans opened 3 months ago
This is an amdgpu bug, not a mpv one. mpv crashes as a result of your gpu resetting, not the other way around.
[ 1777.234990] [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=168813, emitted seq=168816 [ 1777.236251] [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
6.10.1
This kernel version has known issues like https://gitlab.freedesktop.org/drm/amd/-/issues/3142 look around the issue tracker, maybe your issue is already reported
You should try LTS kernel
Thank you, linux-lts also avoids this issue (so far). None of my test cases cause the issue.
I'm leaving this open to point anyone else reporting the bug to linux-lts, as that appears to be the intent.
Do you have any example video (link) that causes the error? Even though I'm on AMD (AMD ATI 04:00.0 Lucienne) I don't get these errors.
I'm presently back on stable software and this already ate several hours of my time between last night and today.
I didn't sample a wide range of videos while affected, particularly as the first two I happened to try to play after encountering the issue both both readily exhibited the symptoms.
VLC was able to play one of the test cases without any issue. This could also be a weakness in how mpv utilizes the interfaces offered by the kernel. However it might just be that VLC happened to have a use pattern that's more resistant and mpv is still operating within the interface specifications. Utilizing an earlier kernel version as suggested in this bug also fixed the issue and allows all the other software I have installed to fulfill their dependencies. However, I didn't happen to encounter the issue with 6.8.x and 6.9.x kernels; that might just mean at the time mpv+stack happened to, similar to VLC presently, avoid triggering the issue.
I didn't see any remotely similar bugs (here) before, nor after being asked to check again... https://github.com/mpv-player/mpv/issues?q=is%3Aissue+amdgpu ; and an amdgpu specific issue seemed unlikely given VLC worked so I hadn't checked for any threads there.
As I just mentioned in the ArchLinux thread (filed more for awareness) I've been using 6.9.x and earlier kernels all month, so I can't comment about if 6.10 worked, or if 6.10.1 introduced the apparent regression. ( https://gitlab.archlinux.org/archlinux/packaging/packages/mpv/-/issues/11 )
After looking at the bugs, I suspect drm:amdgpu_job_timeout (also: ring gfx timeout) are keywords/phrases for related bugs https://gitlab.freedesktop.org/drm/amd/-/issues/?sort=created_date&state=opened&search=amdgpu_job_timedout&first_page_size=20 Though there were also a couple threads that might not have those in easily searched locations.
I'm not sure if this issue for AMDGPU is related or not, but the proposed fixes (return to contiguous GPU memory) sounds like it might be related; that or ring buffers aren't being allocated in a way that works for both sides of the buffer (which is probably forced contiguous buffers). https://gitlab.freedesktop.org/drm/amd/-/issues/3501
I re-tested this evening since ArchLinux had an updated 6.10.2 kernel.
Linux hostname 6.10.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 27 Jul 2024 16:49:55 +0000 x86_64 GNU/Linux
With the current rolling release environment the problematic videos now play back with some visual glitches similar to a lost keyframe and then behave correctly. Those issues do not happen with VLC (same kernel, etc). I was unable to reproduce the issues at all when using the updated system and a custom built 6.10.2 kernel with the patch https://lore.kernel.org/all/20240725080750.183176-1-christian.koenig@amd.com/
(clearly unrelated packages culled from the list) 19 core/libtool 2.5.0+14+g9a4a0261-2 -> 2.5.1-1 18 core/linux 6.10.1.arch1-1 -> 6.10.2.arch1-1 17 core/linux-headers 6.10.1.arch1-1 -> 6.10.2.arch1-1 11 extra/libva 2.21.0-1 -> 2.22.0-1 10 extra/libvpx 1.14.0-1 -> 1.14.1-1 7 extra/netstandard-targeting-pack 8.0.6.sdk106-1 -> 8.0.7.sdk107-1 6 extra/python-numpy 2.0.0-1 -> 2.0.1-1 5 extra/python-trio 0.25.1-1 -> 0.26.0-1 4 extra/python-zipp 3.18.1-2 -> 3.19.2-1 3 extra/svt-av1 2.1.0-1 -> 2.1.2-1 1 multilib/lib32-libvpx 1.14.0-1 -> 1.14.1-1
mpv Information
Other Information
NAME="Arch Linux" PRETTY_NAME="Arch Linux" Linux control 6.10.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 24 Jul 2024 22:25:43 +0000 x86_64 GNU/Linux 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] *-display
description: VGA compatible controller product: Tonga PRO [Radeon R9 285/380] [1002:6939] vendor: Advanced Micro Devices, Inc. [AMD/ATI] [1002] physical id: 0 bus info: pci@0000:01:00.0 logical name: /dev/fb0 version: 00 width: 64 bits clock: 33MHz capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb configuration: depth=32 driver=amdgpu latency=0 resolution=1920,1080 resources: irq:51 memory:e0000000-efffffff memory:f0000000-f01fffff ioport:e000(size=256) memory:f7e00000-f7e3ffff memory:c0000-dffff OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.1.4-arch1.2 driverInfo = Mesa 24.1.4-arch1.2 driverInfo = Mesa 24.1.4-arch1.2 plasma-desktop 6.1.3-1 ArchLinux current stable builds
pacman -U mpv-1\:0.38.0-4-x86_64.pkg.tar.zst libplacebo-6.338.2-6-x86_64.pkg.tar.zst ffmpeg-2\:6.1.1-7-x86_64.pkg.tar.zst x265-3.5-3-x86_64.pkg.tar.zst ffmpeg4.4-4.4.4-5-x86_64.pkg.tar.zst (angry list of other packages, so -d to ignore them as they just won't work, not crash my desktop)
WORKING mpv v0.38.0-dirty Copyright © 2000-2024 mpv/MPlayer/mplayer2 projects built on May 23 2024 10:29:03 libplacebo version: v6.338.2 FFmpeg version: n6.1.1 FFmpeg library versions: libavutil 58.29.100 libavcodec 60.31.102 libavformat 60.16.100 libswscale 7.5.100 libavfilter 9.12.100 libswresample 4.12.100
BROKEN mpv v0.38.0-dirty Copyright _ 2000-2024 mpv/MPlayer/mplayer2 projects built on Jul 3 2024 05:59:22 libplacebo version: v7.349.0 FFmpeg version: n7.0.1 FFmpeg library versions: libavutil 59.8.100 libavcodec 61.3.100 libavformat 61.1.100 libswscale 8.1.100 libavfilter 10.1.100 libswresample 5.1.100
Reproduction Steps
Appears to have a chance of happening ANY time the playback buffer is seeked, including file open.
Expected Behavior
Normal video playback / seek and playback.
Actual Behavior
GPU resets, which TERMINATES Xorg and all running GUI session programs.
Log File
Sorry, I won't be able to collect the --gpu-debug log file, as it'll very likely crash my GPU and kill the xorg session before any useful data is collected. If there is another way of obtaining equally useful data without crashing the GPU and thus killing my entire desktop every time I try to collect the data please let me know.
-- mpv -vo gpu-next crashed plasmashell during these events [ 1766.321165] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0a22c802 [ 1766.321171] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321172] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101F44 [ 1766.321174] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0C8002 [ 1766.321175] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1056580, write from 'TC3' (0x54433300) (200) [ 1766.321237] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07f2a002 [ 1766.321238] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321239] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010120C [ 1766.321240] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002 [ 1766.321241] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053196, write from 'CB2' (0x43423200) (32) [ 1766.321244] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b29002 [ 1766.321245] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101237 [ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B010002 [ 1766.321248] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053239, write from 'CB3' (0x43423300) (16) [ 1766.321255] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772e002 [ 1766.321256] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321257] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101200 [ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002 [ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053184, write from 'CB4' (0x43423400) (160) [ 1766.321262] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772d002 [ 1766.321263] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101232 [ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002 [ 1766.321265] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053234, write from 'CB4' (0x43423400) (160) [ 1766.321268] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07729002 [ 1766.321269] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010123A [ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002 [ 1766.321272] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053242, write from 'CB1' (0x43423100) (80) [ 1766.321275] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732d002 [ 1766.321276] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321277] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001012AB [ 1766.321278] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002 [ 1766.321279] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053355, write from 'CB2' (0x43423200) (32) [ 1766.321282] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07126002 [ 1766.321283] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321284] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010124C [ 1766.321285] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0E0002 [ 1766.321286] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053260, write from 'CB6' (0x43423600) (224) [ 1766.321289] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b21002 [ 1766.321290] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321291] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101223 [ 1766.321292] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002 [ 1766.321293] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80) [ 1766.321296] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002 [ 1766.321297] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277 [ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002 [ 1766.321299] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208) [ 1777.234990] [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=168813, emitted seq=168816 [ 1777.236251] [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80) Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208) Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=168813, emitted seq=168816 Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 Jul 25 22:09:21 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin! Jul 25 22:09:21 HOSTNAME kernel: amdgpu: cp is busy, skip halt cp Jul 25 22:09:22 HOSTNAME kernel: amdgpu: rlc is busy, skip halt rlc Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume Jul 25 22:09:22 HOSTNAME kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400800000). Jul 25 22:09:22 HOSTNAME kernel: [drm] VRAM is lost due to GPU reset! Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring comp_1.2.0 test failed (-110) Jul 25 22:09:22 HOSTNAME kernel: [drm] UVD initialized successfully. Jul 25 22:09:22 HOSTNAME kernel: [drm] VCE initialized successfully. Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start Jul 25 22:09:22 HOSTNAME mpv[5307]: amdgpu: The CS has cancelled because the context is lost. This context is innocent. Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded! Jul 25 22:09:22 HOSTNAME systemd-coredump[5681]: Process 5307 (mpv) of user 1000 terminated abnormally with signal 6/ABRT, processing... Jul 25 22:09:22 HOSTNAME systemd[1]: Created slice Slice /system/drkonqi-coredump-processor. -- Subject: A start job for unit system-drkonqi\x2dcoredump\x2dprocessor.slice has finished successfully
I was experimenting with any -vo option last night, IIRC this might have been -vo gpu rather than -vo gpu-next
Jul 25 22:09:28 HOSTNAME systemd-coredump[5683]: [LNK] Process 5307 (mpv) of user 1000 dumped core.
-- Subject: Process 5307 (mpv) dumped core Jul 25 22:09:28 HOSTNAME drkonqi-coredump-processor[5684]: "/usr/bin/mpv" 5307 "/var/lib/systemd/coredump/core.mpv.1000.a7022ac2080f4d4c858a881ab115c1d2.5307.1721970562000000.zst"
Sample Files
No response
I carefully read all instruction and confirm that I did the following:
--log-file=output.txt
.