ffmpeg HW acceleration crashes GPU on ADL

jvrobert commented 2 years ago

System information

model name : 12th Gen Intel(R) Core(TM) i7-12700K 00:02.0 VGA compatible controller [0300]: Intel Corporation AlderLake-S GT1 [8086:4680] (rev 0c) no display, render only in ffmpeg

Issue behavior

Describe the current behavior

When using the latest compiled media driver and ffmpeg 5 (also happens on 4.x) with latest drm-tip kernel/linuxfirmware bins (also happens on Ubuntu 20.04 HW kernel), ffmpeg (running under Frigate NVR) will support hw acceleration using either qsv or vaapi decode for somewhere between 10-30 minutes (usually, sometimes longer). After that, it crashes the GPU with this error: [ 4009.472554] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out [ 4009.474067] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [27844] [ 4020.835642] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [27844] [ 4020.836679] i915 0000:00:02.0: [drm] Resetting vcs1 for stopped heartbeat on vcs1 [ 4020.837224] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs1 [ 4020.939613] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [ 4021.028683] i915 0000:00:02.0: [drm] ERROR Failed to reset chip [ 4021.028762] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_res et+0x25b/0x2d0 [i915] [ 4021.131605] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [ 4021.133494] i915 0000:00:02.0: [drm] ffmpeg[27844] context reset due to GPU hang [ 4023.672616] ffmpeg[27894]: segfault at 0 ip 0000000000000000 sp 00007fff30a1add8 error 14 i n ffmpeg[556214dda000+b000]

ffmpeg settings: -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format yuv420p

Describe the expected behavior

Not crash.

Debug information

What's libva/libva-utils/gmmlib/media-driver version? root@6d859362545b:/opt/frigate# ls /usr/lib/x86_64-linux-gnu/mfx /usr/lib/x86_64-linux-gnu/libmfx.so.1 /usr/lib/x86_64-linux-gnu/libmfxhw64.so.1 /usr/lib/x86_64-linux-gnu/libmfx.so.1.35 /usr/lib/x86_64-linux-gnu/libmfxhw64.so.1.35

Note re: vainfo, I also tried a new container with ffmpeg and compiled latest version of vainfo, media driver, gmm, everything - same issue.

root@6d859362545b:/opt/frigate# vainfo error: XDG_RUNTIME_DIR not set in the environment. error: can't connect to X server! libva info: VA-API version 1.12.0 libva info: User environment variable requested driver 'iHD' libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_12 libva info: va_openDriver() returns 0 vainfo: VA-API version: 1.12 (libva 2.12.0) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 21.3.3 (6fdf88c) vainfo: Supported profile and entrypoints VAProfileNone : VAEntrypointVideoProc VAProfileNone : VAEntrypointStats VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Simple : VAEntrypointEncSlice VAProfileMPEG2Main : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointFEI VAProfileH264Main : VAEntrypointEncSliceLP VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264High : VAEntrypointFEI VAProfileH264High : VAEntrypointEncSliceLP VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264ConstrainedBaseline: VAEntrypointFEI VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP VAProfileHEVCMain : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointEncSlice VAProfileHEVCMain : VAEntrypointFEI VAProfileHEVCMain : VAEntrypointEncSliceLP VAProfileHEVCMain10 : VAEntrypointVLD VAProfileHEVCMain10 : VAEntrypointEncSlice VAProfileHEVCMain10 : VAEntrypointEncSliceLP VAProfileVP9Profile0 : VAEntrypointVLD VAProfileVP9Profile0 : VAEntrypointEncSliceLP VAProfileVP9Profile1 : VAEntrypointVLD VAProfileVP9Profile1 : VAEntrypointEncSliceLP VAProfileVP9Profile2 : VAEntrypointVLD VAProfileVP9Profile2 : VAEntrypointEncSliceLP VAProfileVP9Profile3 : VAEntrypointVLD VAProfileVP9Profile3 : VAEntrypointEncSliceLP VAProfileHEVCMain12 : VAEntrypointVLD VAProfileHEVCMain12 : VAEntrypointEncSlice VAProfileHEVCMain422_10 : VAEntrypointVLD VAProfileHEVCMain422_10 : VAEntrypointEncSlice VAProfileHEVCMain422_12 : VAEntrypointVLD VAProfileHEVCMain422_12 : VAEntrypointEncSlice VAProfileHEVCMain444 : VAEntrypointVLD VAProfileHEVCMain444 : VAEntrypointEncSliceLP VAProfileHEVCMain444_10 : VAEntrypointVLD VAProfileHEVCMain444_10 : VAEntrypointEncSliceLP VAProfileHEVCMain444_12 : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointEncSliceLP VAProfileHEVCSccMain10 : VAEntrypointVLD VAProfileHEVCSccMain10 : VAEntrypointEncSliceLP VAProfileHEVCSccMain444 : VAEntrypointVLD VAProfileHEVCSccMain444 : VAEntrypointEncSliceLP VAProfileAV1Profile0 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointEncSliceLP

Could you provide libva trace log if possible? Run cmd export LIBVA_TRACE=/tmp/libva_trace.log first then execute the case.

Only useful logs from libva:

/tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273421][ctx 0x10000000]==========va_TraceEndPicture /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273422][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273422][ctx 0x10000000] render_targets = 0x0000001c /tmp/libva_trace.log.184412.thd-0x0000098e:[54444.273504][ctx none]=========vaEndPicture ret = VA_STATUS_ERROR_DECODING_ERROR, internal decoding error /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000]==========va_TraceBeginPicture /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] render_targets = 0x00000019 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245549][ctx 0x10000000] frame_count = #7 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000]==========va_TraceRenderPicture /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000] context = 0x10000000 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245558][ctx 0x10000000] num_buffers = 2 /tmp/libva_trace.log.184412.thd-0x0000098f:[53500.245559][ctx 0x10000000] --------------

Could you attach dmesg log if it's GPU hang by dmesg >dmesg.log 2>&1? [155523.319847] i915 0000:00:02.0: [drm:i915_gem_context_create_ioctl [i915]] HW context 16 created [155534.199385] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [102504] [155534.200411] i915 0000:00:02.0: [drm] Resetting vcs0 for stopped heartbeat on vcs0 [155534.200945] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs0 [155534.302952] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [155534.394325] i915 0000:00:02.0: [drm] ERROR Failed to reset chip [155534.394347] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x258/0x2d0 [i915] [155534.497281] [drm:uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [155534.499244] i915 0000:00:02.0: [drm] ffmpeg[102504] context reset due to GPU hang [155534.520720] intel_gt_invalidate_tlbs: 36 callbacks suppressed [155534.520734] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.525130] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.531383] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.536543] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.540749] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.546000] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms! [155534.551252] i915 0000:00:02.0: [drm] ERROR rcs0 TLB invalidation did not complete in 4ms! [155534.556511] i915 0000:00:02.0: [drm] ERROR bcs0 TLB invalidation did not complete in 4ms
Do you want to contribute a patch to fix the issue? (yes/no):

Jexu commented 2 years ago

It looks gpu hang occurs..., may i know which codec is your decoder content? And it is much better if you can share the content to us and we can take a look locally.

FCLC commented 2 years ago

I've run into the same issue on my end. On initial boot vainfo seems fine.

After attempting to execute: ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v hevc_vaapi sdr_out.mp4 -y

it'll have a few moments before the application crashes.

Once it crashes the vaapi interface seems inoperable. I'll post the before and after below, as well as relevant exerpt from dmesg and FFMPEG logs.

The source file in question is a Sony HDR10 demo file called "Sony Swordsmith HDR UHD 4K Demo.mp4". it can be found in various places online, but the relevant characteristics are Video: hevc (Main 10) (hvc1 / 0x31637668), yuv420p10le(tv, bt2020nc/bt2020/smpte2084), 3840x2160 [SAR 1:1 DAR 16:9], 71382 kb/s, 59.94 fps, 59.94 tbr, 60k tbn (default)

First run of ffmpeg where it initially hangs:

ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i sony.mp4 -vf tonemap_vaapi -c:v hevc_vaapi out.mp4 -y Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'sony.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom creation_time : 2016-10-24T05:33:14.000000Z Duration: 00:01:26.10, start: 0.000000, bitrate: 71567 kb/s Stream #0:0[0x1](und): Video: hevc (Main 10), 1 reference frame (hvc1 / 0x31637668), yuv420p10le(tv, bt2020nc/bt2020/smpte2084, topleft), 3840x2160 [SAR 1:1 DAR 16:9], 71382 kb/s, 59.94 fps, 59.94 tbr, 60k tbn (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Video Media Handler vendor_id : [0][0][0][0] encoder : HEVC Coding Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 192 kb/s (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Sound Media Handler vendor_id : [0][0][0][0] [AVHWDeviceContext @ 0x2127380] libva: VA-API version 1.15.0 [AVHWDeviceContext @ 0x2127380] libva: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so [AVHWDeviceContext @ 0x2127380] libva: Found init function __vaDriverInit_1_13 [AVHWDeviceContext @ 0x2127380] libva: va_openDriver() returns 0 [AVHWDeviceContext @ 0x2127380] Initialised VAAPI connection: version 1.15 [AVHWDeviceContext @ 0x2127380] VAAPI driver: Intel iHD driver for Intel(R) Gen Graphics - 21.4.1 (be92568). [AVHWDeviceContext @ 0x2127380] Driver not found in known nonstandard list, using standard behaviour. Stream mapping: Stream #0:0 -> #0:0 (hevc (native) -> hevc (hevc_vaapi)) Stream #0:1 -> #0:1 (aac (native) -> aac (native)) Press [q] to stop, [?] for help [Parsed_tonemap_vaapi_0 @ 0x674b280] Output format not set, use default format NV12 [graph 0 input from stream 0:0 @ 0x674b5c0] w:3840 h:2160 pixfmt:vaapi tb:1/60000 fr:60000/1001 sar:1/1 [hevc_vaapi @ 0x212a200] Using input frames context (format vaapi) with hevc_vaapi encoder. [hevc_vaapi @ 0x212a200] Input surface format is nv12. [hevc_vaapi @ 0x212a200] Using VAAPI profile VAProfileHEVCMain (17). [hevc_vaapi @ 0x212a200] Using VAAPI entrypoint VAEntrypointEncSlice (6). [hevc_vaapi @ 0x212a200] Using VAAPI render target format YUV420 (0x1). [hevc_vaapi @ 0x212a200] No quality level set; using default (25). [hevc_vaapi @ 0x212a200] RC mode: ICQ. [hevc_vaapi @ 0x212a200] RC quality: 25. [hevc_vaapi @ 0x212a200] RC framerate: 60000/1001 (59.94 fps). [hevc_vaapi @ 0x212a200] Using intra, P- and B-frames (supported references: 4 / 4). [hevc_vaapi @ 0x212a200] All wanted packed headers available (wanted 0xd, found 0x1f). [hevc_vaapi @ 0x212a200] Using level 5. [graph_1_in_0_1 @ 0xa62ea80] tb:1/48000 samplefmt:fltp samplerate:48000 chlayout:0x3 Output #0, mp4, to 'out.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom encoder : Lavf59.16.100 Stream #0:0(und): Video: hevc (Main), 1 reference frame (hev1 / 0x31766568), vaapi(tv, bt2020nc/bt2020/bt709, progressive, topleft), 3840x2160 (0x0) [SAR 1:1 DAR 16:9], q=2-31, 59.94 fps, 60k tbn (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Video Media Handler vendor_id : [0][0][0][0] encoder : Lavc59.18.100 hevc_vaapi Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, delay 1024, 128 kb/s (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Sound Media Handler vendor_id : [0][0][0][0] encoder : Lavc59.18.100 aac ^C^C^CReceived > 3 system signals, hard exiting=00:00:19.58 bitrate=18739.9kbits/s speed=0.786x

After the initial crash:

ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i sony.mp4 -vf "format=nv12:t=bt709" -c:v hevc_vaapi out.mp4 -y Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'sony.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom creation_time : 2016-10-24T05:33:14.000000Z Duration: 00:01:26.10, start: 0.000000, bitrate: 71567 kb/s Stream #0:0[0x1](und): Video: hevc (Main 10), 1 reference frame (hvc1 / 0x31637668), yuv420p10le(tv, bt2020nc/bt2020/smpte2084, topleft), 3840x2160 [SAR 1:1 DAR 16:9], 71382 kb/s, 59.94 fps, 59.94 tbr, 60k tbn (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Video Media Handler vendor_id : [0][0][0][0] encoder : HEVC Coding Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 192 kb/s (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Sound Media Handler vendor_id : [0][0][0][0] [AVHWDeviceContext @ 0x210c380] libva: VA-API version 1.15.0 [AVHWDeviceContext @ 0x210c380] libva: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so [AVHWDeviceContext @ 0x210c380] libva: Found init function __vaDriverInit_1_13 Segmentation fault (core dumped)

dmesg

[Apr 5 21:30] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out [ +0.000713] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.003786] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [28335] [ +11.366505] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [28335] [ +0.001094] i915 0000:00:02.0: [drm] Resetting vcs1 for stopped heartbeat on vcs1 [ +0.000708] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000120] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs1 [ +0.102497] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000730] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.001224] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.018029] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000729] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.001223] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.026018] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000729] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.001225] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.038074] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000729] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.001225] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000509] i915 0000:00:02.0: [drm] *ERROR* Failed to reset chip [ +0.000004] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x2b9/0x300 [i915] [ +0.103959] [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110 [ +0.001036] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000723] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.001222] i915 0000:00:02.0: [drm] *ERROR* vcs1 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ +0.000579] i915 0000:00:02.0: [drm] ffmpeg[28335] context reset due to GPU hang [ +4.946989] Fence expiration time out i915-0000:00:02.0:ffmpeg[28335]:914! [Apr 5 21:37] ffmpeg[28623]: segfault at 0 ip 0000000000000000 sp 00007ffd44a8b2c8 error 14 in ffmpeg[400000+16000] [ +0.000005] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [Apr 5 21:38] ffmpeg[28649]: segfault at 0 ip 0000000000000000 sp 00007ffe6395d708 error 14 in ffmpeg[400000+16000] [ +0.000006] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [Apr 5 21:40] ffmpeg[28895]: segfault at 0 ip 0000000000000000 sp 00007fff368b1688 error 14 in ffmpeg[400000+16000] [ +0.000006] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [Apr 5 21:41] vainfo[28963]: segfault at 0 ip 0000000000000000 sp 00007fff18a56d48 error 14 in vainfo[55868b236000+2000] [ +0.000006] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [Apr 5 21:42] usb 1-5.1.1.3: USB disconnect, device number 12 [Apr 5 21:47] vainfo[29690]: segfault at 0 ip 0000000000000000 sp 00007ffcf1df7888 error 14 in vainfo[558af141b000+2000] [ +0.000005] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [Apr 5 21:48] ffmpeg[29816]: segfault at 0 ip 0000000000000000 sp 00007ffe096e2a88 error 14 in ffmpeg[400000+16000] [ +0.000006] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.

vainfo before crash:

vainfo --display drm --device /dev/dri/renderD128 libva info: VA-API version 1.15.0 libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_13 libva info: va_openDriver() returns 0 vainfo: VA-API version: 1.15 (libva 2.13.0) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 21.4.1 (be92568) vainfo: Supported profile and entrypoints VAProfileNone : VAEntrypointVideoProc VAProfileNone : VAEntrypointStats VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Simple : VAEntrypointEncSlice VAProfileMPEG2Main : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264Main : VAEntrypointFEI VAProfileH264Main : VAEntrypointEncSliceLP VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264High : VAEntrypointFEI VAProfileH264High : VAEntrypointEncSliceLP VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264ConstrainedBaseline: VAEntrypointFEI VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP VAProfileHEVCMain : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointEncSlice VAProfileHEVCMain : VAEntrypointFEI VAProfileHEVCMain : VAEntrypointEncSliceLP VAProfileHEVCMain10 : VAEntrypointVLD VAProfileHEVCMain10 : VAEntrypointEncSlice VAProfileHEVCMain10 : VAEntrypointEncSliceLP VAProfileVP9Profile0 : VAEntrypointVLD VAProfileVP9Profile0 : VAEntrypointEncSliceLP VAProfileVP9Profile1 : VAEntrypointVLD VAProfileVP9Profile1 : VAEntrypointEncSliceLP VAProfileVP9Profile2 : VAEntrypointVLD VAProfileVP9Profile2 : VAEntrypointEncSliceLP VAProfileVP9Profile3 : VAEntrypointVLD VAProfileVP9Profile3 : VAEntrypointEncSliceLP VAProfileHEVCMain12 : VAEntrypointVLD VAProfileHEVCMain12 : VAEntrypointEncSlice VAProfileHEVCMain422_10 : VAEntrypointVLD VAProfileHEVCMain422_10 : VAEntrypointEncSlice VAProfileHEVCMain422_12 : VAEntrypointVLD VAProfileHEVCMain422_12 : VAEntrypointEncSlice VAProfileHEVCMain444 : VAEntrypointVLD VAProfileHEVCMain444 : VAEntrypointEncSliceLP VAProfileHEVCMain444_10 : VAEntrypointVLD VAProfileHEVCMain444_10 : VAEntrypointEncSliceLP VAProfileHEVCMain444_12 : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointVLD VAProfileHEVCSccMain : VAEntrypointEncSliceLP VAProfileHEVCSccMain10 : VAEntrypointVLD VAProfileHEVCSccMain10 : VAEntrypointEncSliceLP VAProfileHEVCSccMain444 : VAEntrypointVLD VAProfileHEVCSccMain444 : VAEntrypointEncSliceLP VAProfileAV1Profile0 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointVLD VAProfileHEVCSccMain444_10 : VAEntrypointEncSliceLP

vainfo after crash

vainfo --display drm --device /dev/dri/renderD128 libva info: VA-API version 1.15.0 libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so libva info: Found init function __vaDriverInit_1_13 Segmentation fault (core dumped)

FCLC commented 2 years ago

Brief side note: to eliminate variables related to latency, core scheduler weirdness etc. the source file was copied to /tmp and then always referred to via soft link, as was the output file.

Ecores were disabled and kernel is 5.17.1 default from upstream in a debian/ubuntu environment.

Primary display is connected to a dGPU, however the iGPU does have a monitor connected to it. This monitor continued to function normally under wayland

eero-t commented 2 years ago

The source file in question is a Sony HDR10 demo file called "Sony Swordsmith HDR UHD 4K Demo.mp4"

@FCLC Please provide a link, to make sure it's the same version you are using. GuC and HuC FW versions would be also good to know. (I'm not media developer, but these have been needed for bugs I've reported myself)

FCLC commented 2 years ago

The source file in question is a Sony HDR10 demo file called "Sony Swordsmith HDR UHD 4K Demo.mp4"

@FCLC Please provide a link, to make sure it's the same version you are using. GuC and HuC FW versions would be also good to know. (I'm not media developer, but these have been needed for bugs I've reported myself)

Sure, file can be found here: https://4kmedia.org/sony-swordsmith-hdr-uhd-4k-demo/

MD5 of the file: a4dcfe93ab98d7e582b2554e7a8008c9 Sony Swordsmith HDR UHD 4K Demo.mp4

SHA1 if preferred: 292eb58f69ae17aabd576aff781953ca6bae9051 Sony Swordsmith HDR UHD 4K Demo.mp4

or SHA 512: e0e3a7f2402d154eb7d2c1f0a11c7915f98e1ea2ff8c6eb8293b14864958bcc6d392d264d647d4658e3f1e6bc193849faea1e13c3efa326522e7835e2cffe779 Sony Swordsmith HDR UHD 4K Demo.mp4

Firmware versions: GuC

GuC firmware: i915/tgl_guc_62.0.0.bin
    status: LOADABLE
    version: wanted 62.0, found 62.0
    uCode: 325632 bytes
    RSA: 256 bytes

GuC status 0x00000001:
    Bootrom status = 0x0
    uKernel status = 0x0
    MIA Core status = 0x0

Scratch registers:
     0:     0x0
     1:     0x0
     2:     0x0
     3:     0x0
     4:     0x0
     5:     0x0
     6:     0x0
     7:     0x0
     8:     0x0
     9:     0x0
    10:     0x0
    11:     0x0
    12:     0x0
    13:     0x0
    14:     0x0
    15:     0x0

GuC log relay not created

HuC

HuC firmware: i915/tgl_huc_7.9.3.bin
    status: LOADABLE
    version: wanted 7.9, found 7.9
    uCode: 589504 bytes
    RSA: 256 bytes
HuC status: 0x00090001

FCLC commented 2 years ago

Something I'm noticing now is that the i915 drivers seem to be using the tiger lake versions, and that the upstream firmware git at https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

does not have any firmware for ADL-S for GuC or HuC, only for ADL-P.

The reason this is off putting is that, though the VAAPI driver page (https://github.com/intel/media-driver#supported-platforms) lists ADL-S under TGLx, it also lists ADL-P, leading me to believe that perhaps a bug was found in using the TGLx driver on ADL-P, and that the same/similar bug may also be present on ADL-S, but has yet to be diagnosed/up streamed?

FCLC commented 2 years ago

Simultaneously, the updated git has newer versions than was available in 5.17.1 mainline, namely guc was 62.0, but git has tgl_guc version 69.0.3. but is defaulting to 62.x instead

why this is is uncertain

FCLC commented 2 years ago

here are 2 straces, one of a funtional amd vainfo dump and one of the failing vainfo attempt on alderlake-s log_amd.txt log_intel.txt

FCLC commented 2 years ago

More debugging, after rebuilding the kernel and rebooting it continues to load version 62 by default.

I was able to use the tonemapping filter. However the issue seems to be recovering the device after a failed attempt. attempting to change the parameters for a second test, I killed the active ffmpeg command (ctrl c) and the iGPU segfaulted.

now back to square one.

This may be related to recovering the chip after an error/buffer overflow?

FCLC commented 2 years ago

editing drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c manually to use version tgl 69.0.3 instead of tgl 62.0.0 may be a possible way forward.

Jexu commented 2 years ago

Hi @FCLC What is your platfom, ADL-P or ADL-S? something is different between both.

eero-t commented 2 years ago

editing drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c manually to use version tgl 69.0.3 instead of tgl 62.0.0 may be a possible way forward.

@FCLC Besides fixes, GuC has also API changes now and then, that's why specific i915 version loads specific GuC version. Therefore changing the GuC version from the i915 sources is not a good idea, unless you somehow know which versions are compatible with each other.

softworkz commented 2 years ago

This seems to have a much wider scope than is already being discussed.

We see this with a different device:

'Device 32902:39497' Id:39497 (Driver: Intel iHD driver for Intel(R) Gen Graphics - 21.2.2 (1dd7d7f), Vendor: 32902) The device id resolves to: TigerLake-LP GT2 [Iris Xe Graphics]

It is not specific to a certain video, also not specific to HDR, neither to HEVC or 4k. It happens with a simple FullHD H,264 video.

Details can be found here: https://emby.media/community/index.php?/topic/107064-quicksync-works-once-per-boot-then-stops-working-and-uses-software/&do=findComment&comment=1130529

Let me know which information you might need. I can also instruct our tester do try certain things.

Thanks, softworkz

FCLC commented 2 years ago

Hi @FCLC What is your platfom, ADL-P or ADL-S? something is different between both.

Hi @Jexu, platform is ADL-S, specifically a 12700-k

FCLC commented 2 years ago

editing drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c manually to use version tgl 69.0.3 instead of tgl 62.0.0 may be a possible way forward.

@FCLC Besides fixes, GuC has also API changes now and then, that's why specific i915 version loads specific GuC version. Therefore changing the GuC version from the i915 sources is not a good idea, unless you somehow know which versions are compatible with each other.

Part of the thinking is that upstream git (Linux/next 2022-04-03 for the 5.18 cycle) already has TGL69.0.3 marked as the expected version for ADL-S, so in theory should be loadable.

I ran into build errors, typical of upstream build especially pre RC1 so was t able to experiment personally.

A different error that may or may not be related is HEVC encoding:

Any and all HEVC encoded content output by ADL-S seems to be coming out either a green or pink mess. Using the same source file above, I can change which colour is dominant by changing the time start parameter, so perhaps something to do with the inter frame data being written to the file? H264 does not experience this issue.

FCLC commented 2 years ago

'Device 32902:39497' Id:39497 (Driver: Intel iHD driver for Intel(R) Gen Graphics - 21.2.2 (1dd7d7f), Vendor: 32902)

The device id resolves to: TigerLake-LP GT2 [Iris Xe Graphics] [...] softworkz

Both of these devices (tgl and ADL-S) load the same subversion of the HuC and GuC firmware per the docs and the driver source, so I'd presume that the point of overlap is there.

It seems to me that this may be related to resetting the state of the device? The kernel source has options for the heartbeat, hang detection and so on, but in this case I'm not seeing that detection being asserted/sent to kernel logs.

FCLC commented 2 years ago

I'm attempting with 5.18-rc1 instead of next, will report back when I can

Jexu commented 2 years ago

So summarize the issue you saw:

The gpu hang occurs with ffmpeg transcode on tgl/adl-s. (Please give the log in /sys/class/drm/card0/error, to check if the hevc clip has real tile)
The gpu is crashed after first hang occurs and need to reboot to recover. (Please check ll /dev/dri after first hang; I915 driver/ guc fail to reset the gpu which normally should not happen)
Do you try the ffmpeg decode only, without encode(transcode)?

FCLC commented 2 years ago

The gpu hang occurs with ffmpeg transcode on tgl/adl-s. (Please give the log in /sys/class/drm/card0/error, to check if the hevc clip has real tile)

I can't speak for tgl, @softworkz could you perhaps ping your tester and have them run more exhaustive testing in the case where the previous testing doesnt answer the above?

As for ADL-S:

$ sudo cat /sys/class/drm/renderD128/device/drm/card0/error 
cat: /sys/class/drm/renderD128/device/drm/card0/error: No such device
$ vainfo --display drm --device /dev/dri/renderD128 
libva info: VA-API version 1.15.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_13
Segmentation fault (core dumped)
$ sudo cat /sys/class/drm/renderD128/device/drm/card0/error 
cat: /sys/class/drm/renderD128/device/drm/card0/error: No such device
$ sudo cat /sys/class/drm/card0/error 
cat: /sys/class/drm/card0/error: No such device

2. The gpu is crashed after first hang occurs and need to reboot to recover. (Please check ll /dev/dri after first hang; I915 driver/ guc fail to reset the gpu which normally should not happen)

Not certain what you mean here; if you mean if the device is still present in /dev, it is

ls /dev/dri/
by-path  card0  card1  renderD128  renderD129

3. Do you try the ffmpeg decode only, without encode(transcode)?

Decode is fine prior to the initial crash.

testing with the standard ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i source.mp4 -f null - as a way to decode and send directly to null the decoder will operate fine prior to the crash.

After the first crash decode and encode are impossible.

after the test with 5.18-rc1 I will try ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i input.mp4 -c:v libx264 -crf 20 output.mp4

Jexu commented 2 years ago

Per your test and log:

I will check if above tested hevc clip has real tile. As i know, some known issue exists in i915 driver with bond submission and real tile decode+encode may occur this. To disable the scalability in media driver maybe helpful.
Even hang hanppens, i915 driver and guc should recover the gpu successfully instead of failure, this maybe kernel issue. [155534.302952] [drm:__uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110 [155534.394325] i915 0000:00:02.0: [drm] ERROR Failed to reset chip [155534.394347] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x258/0x2d0 [i915] [155534.497281] [drm:__uc_sanitize [i915]] ERROR Failed to reset GuC, ret = -110

FCLC commented 2 years ago

Just booted up into 5.18-rc1.

ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i ~/Videos/hdr_source.mp4 -f null -

was able to complete multiple times in a row without error.

full out put from the command is here:

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/home/felix/Videos/hdr_source.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom creation_time : 2016-10-24T05:33:14.000000Z Duration: 00:01:26.10, start: 0.000000, bitrate: 71567 kb/s Stream #0:0[0x1](und): Video: hevc (Main 10), 1 reference frame (hvc1 / 0x31637668), yuv420p10le(tv, bt2020nc/bt2020/smpte2084, topleft), 3840x2160 [SAR 1:1 DAR 16:9], 71382 kb/s, 59.94 fps, 59.94 tbr, 60k tbn (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Video Media Handler vendor_id : [0][0][0][0] encoder : HEVC Coding Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 192 kb/s (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Sound Media Handler vendor_id : [0][0][0][0] [AVHWDeviceContext @ 0x23c0640] libva: VA-API version 1.15.0 [AVHWDeviceContext @ 0x23c0640] libva: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so [AVHWDeviceContext @ 0x23c0640] libva: Found init function __vaDriverInit_1_13 [AVHWDeviceContext @ 0x23c0640] libva: va_openDriver() returns 0 [AVHWDeviceContext @ 0x23c0640] Initialised VAAPI connection: version 1.15 [AVHWDeviceContext @ 0x23c0640] VAAPI driver: Intel iHD driver for Intel(R) Gen Graphics - 21.4.1 (be92568). [AVHWDeviceContext @ 0x23c0640] Driver not found in known nonstandard list, using standard behaviour. Stream mapping: Stream #0:0 -> #0:0 (hevc (native) -> wrapped_avframe (native)) Stream #0:1 -> #0:1 (aac (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help [graph 0 input from stream 0:0 @ 0x6915200] w:3840 h:2160 pixfmt:vaapi tb:1/60000 fr:60000/1001 sar:1/1 [graph_1_in_0_1 @ 0x69223c0] tb:1/48000 samplefmt:fltp samplerate:48000 chlayout:0x3 [format_out_0_1 @ 0x39914c0] auto-inserting filter 'auto_aresample_0' between the filter 'Parsed_anull_0' and the filter 'format_out_0_1' [auto_aresample_0 @ 0x482fb40] ch:2 chl:stereo fmt:fltp r:48000Hz -> ch:2 chl:stereo fmt:s16 r:48000Hz Output #0, null, to 'pipe:': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom encoder : Lavf59.16.100 Stream #0:0(und): Video: wrapped_avframe, 1 reference frame, vaapi(tv, bt2020nc/bt2020/smpte2084, progressive, topleft), 3840x2160 (0x0) [SAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 59.94 fps, 59.94 tbn (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Video Media Handler vendor_id : [0][0][0][0] encoder : Lavc59.18.100 wrapped_avframe Stream #0:1(eng): Audio: pcm_s16le, 48000 Hz, stereo, s16, 1536 kb/s (default) Metadata: creation_time : 2016-10-24T06:29:51.000000Z handler_name : Sound Media Handler vendor_id : [0][0][0][0] encoder : Lavc59.18.100 pcm_s16le No more output streams to write to, finishing.:23.73 bitrate=N/A speed=6.05x frame= 5160 fps=363 q=-0.0 Lsize=N/A time=00:01:26.10 bitrate=N/A speed=6.05x video:2258kB audio:16144kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown Input file #0 (/home/felix/Videos/hdr_source.mp4): Input stream #0:0 (video): 5160 packets read (768127816 bytes); 5160 frames decoded; Input stream #0:1 (audio): 4037 packets read (2066944 bytes); 4036 frames decoded (4132864 samples); Total: 9197 packets (770194760 bytes) demuxed Output file #0 (pipe:): Output stream #0:0 (video): 5160 frames encoded; 5160 packets muxed (2311680 bytes); Output stream #0:1 (audio): 4036 frames encoded (4132864 samples); 4036 packets muxed (16531456 bytes); Total: 9196 packets (18843136 bytes) muxed [AVIOContext @ 0x238dd80] Statistics: 770260444 bytes read, 2 seeks

subsequently running ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -i hdr_source.mp4 -c:v libx264 -crf 20 sdr_out.mp4 -y

Seems to be running fine (currently awaiting file to complete, encoding UHD H264 420 10bit@60fps isnt a small task even on the lastest of chips.)

The file is fine: (gnome screen cap of ffplay output image as POC)

I'm testing with the same outputting to libx265 now, and will follow up after wards with encoding using vappi

Jexu commented 2 years ago

Good to know it works with 5.18-rc1.

FCLC commented 2 years ago

Now attempting to run

ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v h264_vaapi sdr_out.mp4 -y and ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v hevc_vaapi sdr_out.mp4 -y

Encoding wit vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v h264_vaapi

encoding with "vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v h264_vaapi"

encoding with vaapi -i hdr_source.mp4 -vf tonemap_vaapi -c:v hevc_vaapi

Unfortunately HEVC continues to be completely broken

log from ffplay

ffplay-20220407-103119.log

FCLC commented 2 years ago

As a sanity check, I've now run HEVC p010 and nv12 ffmpeg -init_hw_device vaapi=decdev:/dev/dri/renderD128 -init_hw_device vaapi=encdev:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device decdev -hwaccel_output_format vaapi -i hdr_source.mp4 -filter_hw_device encdev -vf 'tonemap_vaapi,hwdownload,format=nv12,hwupload' -c:v hevc_vaapi -b:v 5M sdr_out.mp4

H264 nv12 ffmpeg -init_hw_device vaapi=decdev:/dev/dri/renderD128 -init_hw_device vaapi=encdev:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device decdev -hwaccel_output_format vaapi -i hdr_source.mp4 -filter_hw_device encdev -vf 'tonemap_vaapi,hwdownload,format=p010,hwupload' -c:v hevc_vaapi -b:v 5M sdr_out.mp4

They both use the intel iGPU and VAAPI to decode the HEVC 10 bit HDR file, tonemap it to SDR rec 709. They then pass the file on to a known good fully functional amd gpu VAAPI instance at dev/dri/rend129 for render at either

p010 hevc

nv12 hevc

nv12 h264

interesting result:

running ffmpeg -init_hw_device vaapi=decdev:/dev/dri/renderD128 -init_hw_device vaapi=encdev:/dev/dri/renderD129 -hwaccel vaapi -hwaccel_device decdev -hwaccel_output_format vaapi -i hdr_source.mp4 -filter_hw_device encdev -vf 'tonemap_vaapi=format=p010,hwdownload,hwupload' -c:v hevc_vaapi -b:v 30M sdr_out.mp4 -y led to

This was cause by omitting the format=p010 between hwdownload and hwupload

re adding the format fixes the issue, which makes me wonder if that may be responsible for the hevc_vaapi issues above?

FCLC commented 2 years ago

Testing only the hevc encoder with ffmpeg -hide_banner -loglevel verbose -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'format=p010,hwupload' -c:v hevc_vaapi -b:v 15M -profile:v 2 sdr_out.mp4

Yields:

attempting hevc_qsv using: ffmpeg -hide_banner -loglevel verbose -init_hw_device qsv=hw -filter_hw_device hw -i hdr_source.mp4 -vf hwupload=extra_hw_frames=64,format=qsv -c:v hevc_qsv -b:v 30M sdr_out.mp4

tracking down where the problem is:

as of now there's a few issues:

TGLx has issues with re-initializing the gpu if the device is killed durring an active session.
hevc_vaapi encode has an issue where by diverging from the QSV code path, it produces garbage data
interactions between QSV and vaapi seems broken when mapping between internal surfaces. running ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'scale_vaapi=1920:1080,hwmap=derive_device=qsv,format=qsv' -c:v hevc_qsv -b:v 30M sdr_out.mp4 -y results in
```
Error while filtering: Cannot allocate memory
Failed to inject frame into filter network: Cannot allocate memory
Error while processing the decoded data for stream #0:0
[AVIOContext @ 0x22ec040] Statistics: 0 bytes written, 0 seeks, 0 writeouts
^C^C^CReceived > 3 system signals, hard exiting
```


running 
`ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'hwmap=derive_device=qsv,format=qsv' -c:v hevc_qsv -b:v 30M sdr_out.mp4 -y`

[hevc_qsv @ 0x310c440] Using input frames context (format qsv) with hevc_qsv encoder. [hevc_qsv @ 0x310c440] Encoder: input is video memory surface corrupted double-linked list Aborted (core dumped)

FCLC commented 2 years ago

Finally, a piece of good news:

the issue of quitting an active session crashing the vaapi device completely to an unrecoverable state does seem to be solved in 5.18 rc-1

FCLC commented 2 years ago

I've created a small bash script to begin testing speed and options more exhaustively.

one thing I'm noticing is that the qsv and vaapi encoders seem to have different performance characteristics.

running

$ cat benchmark.sh 
#!/bin/bash 

echo "vaapi render 128- intel"

ffmpeg -loglevel quiet -stats -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i hdr_source.mp4 -f null -

echo "QSV render 128- intel" 

ffmpeg -loglevel quiet -stats -hwaccel qsv -c:v hevc_qsv -i hdr_source.mp4 -f null -

results in:

vaapi render 128- intel
frame= 5160 fps=368 q=-0.0 Lsize=N/A time=00:01:26.10 bitrate=N/A speed=6.14x    
QSV render 128- intel
frame= 5160 fps=344 q=-0.0 Lsize=N/A time=00:01:26.10 bitrate=N/A speed=5.75x

Which means that the vaapi path is outperforming the QSV code path

Jexu commented 2 years ago

@jvrobert Please try 5.18 rc-1 as @FCLC said to check if it helps for your issue and help close this if solved.

eero-t commented 2 years ago

one thing I'm noticing is that the qsv and vaapi encoders seem to have different performance characteristics.

@FCLC perf is out of scope for this ticket.

FYI: These ffmpeg backends have some differences in how they do threading, syncing, allocation etc, and this can change from FFmpeg version to another. See for example these old tickets of mine:

mem: https://trac.ffmpeg.org/ticket/7943
perf: https://trac.ffmpeg.org/ticket/7706
perf: https://trac.ffmpeg.org/ticket/7690

If you would use Gstreamer, or "sample_multi_transcode" tool from MediaSDK / OneVPL to do the same thing, you'd likely see some differences to FFmpeg perf with these APIs too, for same reasons. But yes, VA-API has been faster than QSV in about everything since FFmpeg git HEAD fixed 3 year old VA-API perf regression a bit over month ago.

softworkz commented 2 years ago

But yes, VA-API has been faster than QSV in about everything since FFmpeg git HEAD fixed 3 year old VA-API perf regression a bit over month ago.

I can't confirm this. From our experience on many different user systems, QSV has shown better performance in almost every case with H.264 encoding and various decoders and processing filters.

Which encoding codec are you talking about?

softworkz commented 2 years ago

@jvrobert Please try 5.18 rc-1 as @FCLC said to check if it helps for your issue and help close this if solved.

Do I understand correctly that the only way to fix this is to change the kernel?

eero-t commented 2 years ago

Which encoding codec are you talking about?

@softworkz See e.g. the above FFmpeg 7706 ticket for QSV & VA-API command lines and commit ID you need (there's no FFmpeg release yet that would include fix to that large 3 years old VA-API perf regression, you need to build Git master yourself).

Do I understand correctly that the only way to fix this is to change the kernel?

While GPU hang could be due either kernel or user-space driver issue, if hang recovery fails, that's always a kernel bug (separate from the hang itself).

softworkz commented 2 years ago

@eero-t - you wrote:

But yes, VA-API has been faster than QSV in about everything

while the ticket 7706 says:

VAAPI H264 transcode performance dropped 20-30%

softworkz commented 2 years ago

While GPU hang could be due either kernel or user-space driver issue, if hang recovery fails, that's always a kernel bug (separate from the hang itself).

I don't understand. What does this mean? In the past years there didn't exist any case where the only way to use that hardware would have been to change the kernel version. Many of our users can't, don't want, aren't allowed or aren't able to make such change.

eero-t commented 2 years ago

@softworkz While I'm answering you once more, none of this is relevant to the media-driver / GPU hang discussion here. Please ask your questions in the appropriate place for them. Kernel bugs & updates belong to kernel driver projects (in this case, i915) and/or upstream kernel. FFmpeg performance belongs to FFmpeg project (e.g. tickets I linked).

I don't understand. What does this mean? In the past years there didn't exist any case where the only way to use that hardware would have been to change the kernel version.

Only kernel can do device hang recovery. If that is buggy, you need a fix. Fix is in kernel. Besides bug fixes, you often need kernel updates also to get support for new HW devices.

Many of our users can't, don't want, aren't allowed or aren't able to make such change.

If you cannot get / build newer kernel now, then you obviously need to wait.

E.g. Ubuntu LTS releases get new HWE (hardware enabling) kernels every few months: https://wiki.ubuntu.com/Kernel/LTSEnablementStack

And enterprise distros occasionally backport fixes from latest upstream to their ancient kernel versions. If you want to expedite that process, let your ISV know the importance (and existence) of given kernel fix.

But yes, VA-API has been faster than QSV in about everything

while the ticket 7706 says:

VAAPI H264 transcode performance dropped 20-30%

That (3 year old) regression was fixed in FFmpeg master over month ago. Before that regression, and after its fix, VA-API backend in FFmpeg is faster than QSV one in my tests. During the 3 years while that perf regression was in effect, doing single transcode with FFmpeg was faster with QSV than VA-API in those tests, but that was FFmpeg bug.

(Note: this was on Ubuntu i.e. using powersave / ondemand governor, with latest drm-tip Git kernel in addition to latest media stack from Git. Doing many parallel transcode operations in parallel, was still in general faster with VA-API during that period in my tests though, QSV was faster only when doing single transcode instance.)

softworkz commented 2 years ago

While I'm answering you once more

This is very generous of you.

Before that regression, and after its fix, VA-API backend in FFmpeg is faster than QSV one in my tests. During the 3 years while that perf regression was in effect, doing single transcode with FFmpeg was faster with QSV than VA-API in those tests, but that was FFmpeg bug.

Thanks for the explanation, the timeline wasn't obvious.

FFmpeg performance belongs to FFmpeg project (e.g. tickets I linked).

I'm afraid, but I responded to your comment. I hadn't brought up that subject.

none of this is relevant to the media-driver / GPU hang discussion here

Could you kindly let me know whether the symptoms I had referenced (https://emby.media/community/index.php?/topic/107064-quicksync-works-once-per-boot-then-stops-working-and-uses-software/page/2/#comment-1130529) are relevant to the

media-driver / GPU hang discussion here

or solely a matter of that kernel update?

Thanks, sw

FCLC commented 2 years ago

one thing I'm noticing is that the qsv and vaapi encoders seem to have different performance characteristics.

@FCLC perf is out of scope for this ticket.

FYI: These ffmpeg backends have some differences in how they do threading, syncing, allocation etc, and this can change from FFmpeg version to another. See for example these old tickets of mine:
* mem: https://trac.ffmpeg.org/ticket/7943

* perf: https://trac.ffmpeg.org/ticket/7706

* perf: https://trac.ffmpeg.org/ticket/7690
If you would use Gstreamer, or "sample_multi_transcode" tool from MediaSDK / OneVPL to do the same thing, you'd likely see some differences to FFmpeg perf with these APIs too, for same reasons. But yes, VA-API has been faster than QSV in about everything since FFmpeg git HEAD fixed 3 year old VA-API perf regression a bit over month ago.

Sounds good, was more so a minor observation as a side effect of different testing scenarios.

FCLC commented 2 years ago

@jvrobert Please try 5.18 rc-1 as @FCLC said to check if it helps for your issue and help close this if solved.

Do I understand correctly that the only way to fix this is to change the kernel?

more so this may be a way forward that I've found.

Once we bisect what the difference is that is causing the fix, it can be backported to LTS kernels used by (examples on the ubuntu side: 18.04, 20.04) with kernels 5.4.x, 5.10.x etc.

FCLC commented 2 years ago

for those not experienced in self building kernels:

The following is a known good kernel config for adl-s on a 12700k running pop-os (basically modified version of ubuntu that is rolling release):

config.txt

You'll have to rename to .config. mv config.txt .config will do it

make menuconfig to double check that everything seems fine

NB: if trying to build 5.18 RC series kernels, Linus has insisted on re-enabling the -werror parameter for gcc in kernel builds, meaning that RC-1 failed until I went and disabled certain complaining modules around gvtg and kvm as well as werror checking for certain areas.

A unique setup of my environment is that I'm running GCC-12 for developing avx512-fp16 BLAS kernels, normal builds don't need this and should instead use gcc-11.2 mainline

FCLC commented 2 years ago

@eero-t @Jexu regarding the HEVC encoder issues seen in https://github.com/intel/media-driver/issues/1342#issuecomment-1091817370 and https://github.com/intel/media-driver/issues/1342#issuecomment-1091913451

in reference to the second issue listed below, it has not been documented in https://github.com/intel/media-driver#known-issues-and-limitations

as of now there's a few issues:

1. TGLx has issues with re-initializing the gpu if the device is killed durring an active session.

2. hevc_vaapi encode has an issue where by diverging from the QSV code path, it produces garbage data

3. interactions between QSV and vaapi seems broken when mapping between internal surfaces.
   running `ffmpeg -loglevel verbose -hide_banner -hwaccel vaapi -hwaccel_output_format vaapi -vaapi_device /dev/dri/renderD128 -i hdr_source.mp4 -vf 'scale_vaapi=1920:1080,hwmap=derive_device=qsv,format=qsv' -c:v hevc_qsv -b:v 30M sdr_out.mp4 -y `
   results in

Error while filtering: Cannot allocate memory
Failed to inject frame into filter network: Cannot allocate memory
Error while processing the decoded data for stream #0:0
[AVIOContext @ 0x22ec040] Statistics: 0 bytes written, 0 seeks, 0 writeouts
^C^C^CReceived > 3 system signals, hard exiting

and I also don't see anything regarding issue 3.

Issue 2 seems to me as a media-driver related issue and should be solved here.

Issue 3 may be a combination off ffmpeg hardware surface mappings as well as intel libva driver issues. However the above command works fine on previous generation chips, so I'm erring towards the side of the issue being on the graphics stack side of things.

Should we be opening new issues for these?

eero-t commented 2 years ago

or solely a matter of that kernel update?

@softworkz Anything where reboot is needed to restore GPU to a working state, is a kernel (or FW) bug. Looking at your dmesg, issue could be also on FW side, but that will typically also need kernel update, as specific kernel versions load only specific FW versions (that have compatible API/ABI).

softworkz commented 2 years ago

or solely a matter of that kernel update?

@softworkz Anything where reboot is needed to restore GPU to a working state, is a kernel (or FW) bug. Looking at your dmesg, issue could be also on FW side, but that will typically also need kernel update, as specific kernel versions load only specific FW versions (that have compatible API/ABI).

Thanks a lot, that makes the situation more clear, but also even more unpleasant (or almost impossible) to deliver as part of an installation package (essentially a whole range of installation packages for multiple distros and platforms). I choose the blue pill... :-)

eero-t commented 2 years ago

For an update, you do not necessarily need to replace whole kernel or do reboot (unless GPU is already in unrecoverable state). Just modprobing updated i915 module (after installing compatible FW) can be enough, but that is not necessarily easier. It still needs to be built for that particular kernel version, and to modprobe new version, you need to rmmod old version first (which can be hard if you do not know what is blocking that).

softworkz commented 2 years ago

Thanks a lot. I'll see what our packaging expert will say, but it doesn't really sound feasible.

Trying to look at it from a different angle: what do you think how long it might take until this turns into a rare issue, only affecting a very small percentage of Linux installations (of all flavors)?

FCLC commented 2 years ago

Trying to look at it from a different angle: what do you think how long it might take until this turns into a rare issue, only affecting a very small percentage of Linux installations (of all flavors)?

will depend very much on the distributions that emby has in their LTS pipe.

For now something that you may want to consider is that on installation, check

if (platform == adl-s || platform == adl-n || platform == adl-p || platform == tgl || platform == rkl || platform == rpl-s ||) {

if (kernel version < 5.18) { print "platform has known issues with VAAPI using intel iGPU's. Disabling VAAPI and falling back on QSV and software filters"

} }

perform this check on updates? I'd assume emby has a working mechanism checking for available mechanisms, so this shouldn't be too hard to add in as another condition.

the more precises way to check would be via the huc and guc firmware versions, both of which can be checked via sys

softworkz commented 2 years ago

I'd assume emby has a working mechanism checking for available mechanisms, so this shouldn't be too hard to add in as another condition.

Yes, we have a detection calling libva directly.

the more precises way to check would be via the huc and guck firmware versions,

I guess you mean similar to this?

struct drm_i915_getparam gp;
int fd = open("/dev/dri/renderD128", O_RDWR);

gp.param = I915_PARAM_HUC_STATUS;
gp.value = value;

drmCommandWriteRead(fd, DRM_I915_GETPARAM, &gp, sizeof(gp)) == 0;

softworkz commented 2 years ago

print "platform has known issues with VAAPI using intel iGPU's. Disabling VAAPI and falling back on QSV and software filters"

Yup, that's similar to the plan I already made for JSL/EHL, which I previously thought would be the one and only painpoint..

FCLC commented 2 years ago

I guess you mean similar to this?

struct drm_i915_getparam gp;
int fd = open("/dev/dri/renderD128", O_RDWR);

gp.param = I915_PARAM_HUC_STATUS;
gp.value = value;

drmCommandWriteRead(fd, DRM_I915_GETPARAM, &gp, sizeof(gp)) == 0;

That should be workable. Otherwise if you're using something like a bash script for configure/install, you could cat /sys/kernel/debug/dri/0/gt/uc/guc_info and then also cat /sys/kernel/debug/dri/0/gt/uc/huc_info

edit:

example output on 5.18-rc1:

~/Videos$ sudo cat /sys/kernel/debug/dri/0/gt/uc/guc_info
GuC firmware: i915/tgl_guc_69.0.3.bin
    status: RUNNING
    version: wanted 69.0, found 69.0
    uCode: 342912 bytes
    RSA: 256 bytes

GuC status 0x8003f0ec:
    Bootrom status = 0x76
    uKernel status = 0xf0
    MIA Core status = 0x3

Scratch registers:
     0:     0x0
     1:     0x163fdf
     2:     0x40000
     3:     0x4000
     4:     0x40
     5:     0x2ec8
     6:     0x4680000c
     7:     0x0
     8:     0x0
     9:     0x0
    10:     0x0
    11:     0x0
    12:     0x0
    13:     0x0
    14:     0x0
    15:     0x0

GuC log relay not created

~/Videos$ sudo cat /sys/kernel/debug/dri/0/gt/uc/huc_info
HuC firmware: i915/tgl_huc_7.9.3.bin
    status: RUNNING
    version: wanted 7.9, found 7.9
    uCode: 589504 bytes
    RSA: 256 bytes
HuC status: 0x00090001

FCLC commented 2 years ago

Yup, that's similar to the plan I already made for JSL/EHL, which I previously thought would be the one and only painpoint..

for what it's worth, I'm already having to do similar things to get around the lack of HDR support in the AMDGPU vaapi driver stack.

(Thankfully we're well past the dark days of OpenCL 1.0)

softworkz commented 2 years ago

get around the lack of HDR support in the AMDGPU vaapi driver

They are doing so little to get better AMD support into ffmpeg...we just have minimal support for these..

falling back on QSV and software filters

Why "falling back on QSV"?

intel / media-driver