riastradh / netbsd-src

Automatic conversion of the NetBSD src CVS module, use with care
https://www.NetBSD.org
3 stars 1 forks source link

amdgpu: reset doesn't work #28

Open mrgtwentythree opened 3 years ago

mrgtwentythree commented 3 years ago

Radeon RX 550 in a Ryzen 3600.

when amdgpu hangs and resets itself it does not work, and the display remains in a damaged state (looks like static). ie the reset does not actually work, and there are extra errors during the reset than the first messages at boot time.

[ 1782.774072] ERROR ring gfx timeout, signaled seq=50, emitted seq=53 [ 1782.784072] ERROR Process information: process mpv pid 3895 thread pid 3836 [ 1782.794072] amdgpu0: info: GPU reset begin! [ 1783.454066] cp is busy, skip halt cp [ 1783.684065] rlc is busy, skip halt rlc [ 1783.694064] amdgpu0: info: GPU BACO reset [ 1783.844063] amdgpu0: info: GPU reset succeeded, trying to resume [ 1783.854063] kern info: [drm] PCIE GART of 256M enabled (table at 0x000000F400900000). [ 1783.864063] kern info: [drm] VRAM is lost due to GPU reset! [ 1783.934062] message 308 was not supported [ 1783.944063] No valid PCIE lane width reported [ 1783.954062] last message was not supported [ 1784.014061] kern info: [drm] UVD and UVD ENC initialized successfully. [ 1784.164060] kern info: [drm] VCE initialized successfully. [ 1784.174060] kern info: [drm] recover vram bo from shadow start [ 1784.174060] kern info: [drm] recover vram bo from shadow done [ 1784.184060] kern info: [drm] Skip scheduling IBs! [ 1784.184060] kern info: [drm] Skip scheduling IBs! [ 1784.194060] ERROR Failed to initialize parser -87! [ 1784.204060] amdgpu0: info: GPU reset(2) succeeded! [ 1784.204060] ERROR Failed to initialize parser -87! [ 1784.214060] ERROR Failed to initialize parser -87! [ 1784.224059] ERROR Failed to initialize parser -87! [ 1794.183969] ERROR ring gfx timeout, signaled seq=54, emitted seq=54 [ 1794.193969] ERROR Process information: process mpv pid 3895 thread pid 3836 [ 1794.213969] amdgpu0: info: GPU reset begin! [ 1794.693965] amdgpu0: info: GPU BACO reset [ 1794.853963] amdgpu0: info: GPU reset succeeded, trying to resume [ 1794.863963] kern info: [drm] PCIE GART of 256M enabled (table at 0x000000F400900000). [ 1794.863963] kern info: [drm] VRAM is lost due to GPU reset! [ 1794.943964] message 308 was not supported [ 1794.953963] No valid PCIE lane width reported [ 1794.953963] No valid PCIE lane width reported [ 1794.953963] No valid PCIE lane width reported [ 1794.963963] last message was not supported [ 1795.023962] kern info: [drm] UVD and UVD ENC initialized successfully. [ 1795.173961] kern info: [drm] VCE initialized successfully. [ 1795.173961] kern info: [drm] recover vram bo from shadow start [ 1795.183961] kern info: [drm] recover vram bo from shadow done [ 1795.203960] amdgpu0: info: GPU reset(3) succeeded! [ 2613.976577] ERROR Failed to initialize parser -87! [ 2613.986577] ERROR Failed to initialize parser -87!

this last message actually repeated about 8 times, and was linked to me pressing ^C against 'mpv'. the -87 (-ECANCELED) at 1784s timestamp is not me pressing anything.

mrgtwentythree commented 3 years ago

sometimes failed resets were fixable with an X, but not always. i killed the X server as well and now both mpv and X have these stack traces. first mpv, then X. they're both stuck apparently waiting for ever, while trying to exit().

crash> bt/a ffffe0562e6061c0 trace: pid 3836 lid 771 at 0xffffb484bc353a40 sleepq_block() at sleepq_block+0x12c cv_wait() at cv_wait+0x42 linux_dma_fence_default_wait() at linux_dma_fence_default_wait+0x163 linux_dma_fence_wait_timeout() at linux_dma_fence_wait_timeout+0xde linux_dma_fence_wait() at linux_dma_fence_wait+0x52 amdgpu_vm_fini() at amdgpu_vm_fini+0xa9 amdgpu_driver_postclose_kms() at amdgpu_driver_postclose_kms+0x133 drm_file_free() at drm_file_free+0x1a8 drm_close() at drm_close+0x60 closef() at closef+0x60 fd_free() at fd_free+0x1e4 exit1() at exit1+0x13e sigexit() at sigexit+0x1da sendsig() at sendsig lwp_userret() at lwp_userret+0x1c3 mi_userret() at mi_userret+0x249 syscall() at syscall+0x116 --- syscall (number 4) --- syscall+0x116: crash> bt/a ffffe05620d7eb80 trace: pid 3519 lid 3519 at 0xffffb484bc160b90 sleepq_block() at sleepq_block+0x12c cv_wait() at cv_wait+0x42 linux_dma_fence_default_wait() at linux_dma_fence_default_wait+0x163 linux_dma_fence_wait_timeout() at linux_dma_fence_wait_timeout+0xde linux_dma_fence_wait() at linux_dma_fence_wait+0x52 amdgpu_vm_fini() at amdgpu_vm_fini+0xa9 amdgpu_driver_postclose_kms() at amdgpu_driver_postclose_kms+0x133 drm_file_free() at drm_file_free+0x1a8 drm_close() at drm_close+0x60 closef() at closef+0x60 fd_free() at fd_free+0x1e4 exit1() at exit1+0x13e sys_exit() at sys_exit+0x39 syscall() at syscall+0x196 --- syscall (number 1) --- syscall+0x196: